AI TL;DR
Anthropic releases Claude Opus 4.6 with revolutionary agent teams feature, 1M token context window, and state-of-the-art performance on coding and reasoning benchmarks. Here's everything you need to know.
Claude Opus 4.6 Review: Agent Teams, 1M Context, and Industry-Leading Performance
On February 5, 2026, Anthropic released Claude Opus 4.6—the most significant upgrade to their flagship model since Opus 4.5 launched in November 2025. This isn't just an incremental improvement. Opus 4.6 introduces agent teams, a 1-million token context window, and performance that outpaces GPT-5.2 across multiple benchmarks.
What's New in Claude Opus 4.6
Agent Teams: AI Collaboration at Scale
The headline feature is Agent Teams—a research preview that lets you spin up multiple Claude agents working in parallel as a coordinated team.
Agent Teams Architecture:
├── Main orchestrating agent
├── Subagent 1: Code review
├── Subagent 2: Documentation
├── Subagent 3: Testing
└── Autonomous coordination between agents
According to Anthropic's Head of Product Scott White:
"Instead of one agent working through tasks sequentially, you can split the work across multiple agents—each owning its piece and coordinating directly with the others."
White compared it to having a talented team of humans working for you, noting that agents "coordinate in parallel [and work] faster."
Best Use Cases for Agent Teams:
- Tasks that split into independent, read-heavy work
- Codebase reviews across multiple repositories
- Large documentation projects
- Complex research requiring parallel investigation
You can take over any subagent directly using Shift+Up/Down or tmux integration.
1M Token Context Window (Beta)
Opus 4.6 is the first Opus-class model with a 1-million token context window. This is comparable to what Sonnet 4 and 4.5 offer, but now available in Anthropic's most powerful model.
Context Window Pricing:
- Standard (up to 200k tokens): $5/$25 per million input/output tokens
- Premium (200k+ tokens): $10/$37.50 per million input/output tokens
Why 1M Context Matters:
- Work with larger codebases without splitting
- Process massive documents in a single session
- Maintain coherence over extremely long conversations
128K Output Tokens
Opus 4.6 supports outputs of up to 128k tokens—allowing Claude to complete larger tasks without breaking them into multiple requests.
Benchmark Performance: State of the Art
Anthropic has positioned Opus 4.6 as an industry leader across multiple categories:
Knowledge Work (GDPval-AA)
On GDPval-AA—an evaluation of economically valuable knowledge work in finance, legal, and other domains:
| Model | Elo Score |
|---|---|
| Claude Opus 4.6 | Highest |
| GPT-5.2 | -144 points |
| Claude Opus 4.5 | -190 points |
Opus 4.6 outperforms GPT-5.2 by approximately 144 Elo points, which translates to scoring higher approximately 70% of the time in direct comparisons.
Agentic Coding (Terminal-Bench 2.0)
Opus 4.6 achieves the highest score on Terminal-Bench 2.0, the leading agentic coding evaluation.
Reasoning (Humanity's Last Exam)
On Humanity's Last Exam—a complex multidisciplinary reasoning test—Opus 4.6 leads all other frontier models.
Agentic Search (BrowseComp)
Opus 4.6 outperforms every other model on BrowseComp, which measures ability to locate hard-to-find information online. With a multi-agent harness, scores increased to 86.8%.
Long-Context Performance
One of the most significant improvements is in long-context handling:
MRCR v2 (8-needle 1M variant):
| Model | Score |
|---|---|
| Claude Opus 4.6 | 76% |
| Claude Sonnet 4.5 | 18.5% |
This is a 4x improvement in the model's ability to retrieve information "hidden" in vast amounts of text.
Specialized Domain Performance
- Harvey Legal (BigLaw Bench): 90.2% score—highest of any Claude model
- NBIM Cybersecurity: Best results in 38 out of 40 investigations in blind ranking
- Box Multi-Source Analysis: 68% vs 58% baseline (10% lift)
Early Access Partner Testimonials
Anthropic shared feedback from major tech companies using Opus 4.6:
Notion
"Claude Opus 4.6 is the strongest model Anthropic has shipped. It takes complicated requests and actually follows through, breaking them into concrete steps, executing, and producing polished work." — Sarah Sachs, AI Lead
GitHub
"Early testing shows Claude Opus 4.6 delivering on the complex, multi-step coding work developers face every day—especially agentic workflows that demand planning and tool calling." — Mario Rodriguez, Chief Product Officer
Replit
"Claude Opus 4.6 is a huge leap for agentic planning. It breaks complex tasks into independent subtasks, runs tools and subagents in parallel, and identifies blockers with real precision." — Michele Catasta, President
Cursor
"Claude Opus 4.6 is the new frontier on long-running tasks from our internal benchmarks and testing. It's also been highly effective at reviewing code." — Michael Truell, Co-founder & CEO
Cognition (Devin)
"Claude Opus 4.6 reasons through complex problems at a level we haven't seen before. It considers edge cases that other models miss." — Scott Wu, CEO
SentinelOne
"Claude Opus 4.6 handled a multi-million-line codebase migration like a senior engineer. It planned up front, adapted its strategy as it learned, and finished in half the time." — Gregor Stewart, Chief AI Officer
Rakuten
"Claude Opus 4.6 autonomously closed 13 issues and assigned 12 issues to the right team members in a single day, managing a ~50-person organization across 6 repositories." — Yusuke Kaji, General Manager, AI
New API Features
Adaptive Thinking
Previously, developers only had a binary choice between enabling or disabling extended thinking. Now with adaptive thinking, Claude can decide when deeper reasoning would be helpful.
At the default effort level (high), the model uses extended thinking when useful, but developers can adjust this behavior.
Effort Levels
Four new effort levels give developers control over intelligence, speed, and cost:
| Level | Description | Best For |
|---|---|---|
| Low | Minimal thinking | Simple queries, high speed |
| Medium | Balanced | General tasks |
| High (default) | Extended when useful | Complex reasoning |
| Max | Maximum reasoning | Hardest problems |
Context Compaction (Beta)
Long-running conversations and agentic tasks often hit the context window. Context compaction automatically summarizes and replaces older context when the conversation approaches a configurable threshold.
This lets Claude perform longer tasks without hitting limits—essential for autonomous agent workflows.
US-Only Inference
For workloads requiring US data residency, US-only inference is available at 1.1× token pricing.
Product Updates
Claude in PowerPoint (Research Preview)
Claude now integrates directly into PowerPoint as an accessible side panel. Previously, you had to export presentations from Claude and import them separately. Now presentations can be crafted directly within PowerPoint.
Available for Max, Team, and Enterprise plans.
Claude in Excel Upgrades
Claude in Excel now handles:
- Long-running and harder tasks with improved performance
- Pre-planning before acting
- Ingesting unstructured data and inferring correct structure
- Multi-step changes in one pass
Safety and Alignment
These intelligence gains do not come at the cost of safety. According to Anthropic's automated behavioral audit:
Misaligned Behavior Rates:
- Claude Opus 4.6: Low rates of deception, sycophancy, user delusion encouragement
- Overall alignment: As good as or better than Opus 4.5 (most-aligned frontier model to date)
Over-Refusals: Opus 4.6 shows the lowest rate of over-refusals—where the model fails to answer benign queries—of any recent Claude model.
Enhanced Safety Testing
For Opus 4.6, Anthropic ran their most comprehensive safety evaluations ever:
- New evaluations for user wellbeing
- More complex tests of dangerous request refusal
- Updated evaluations for surreptitious harmful actions
- Interpretability methods to understand model behavior
Cybersecurity Safeguards
Since Opus 4.6 shows enhanced cybersecurity abilities, Anthropic developed six new cybersecurity probes to detect harmful responses. They're also using the model for cyberdefense—finding and patching vulnerabilities in open-source software.
Pricing
Pricing remains the same as Opus 4.5:
| Type | Price |
|---|---|
| Input tokens | $5 per million |
| Output tokens | $25 per million |
| Premium context (>200k) | $10/$37.50 per million |
| US-only inference | 1.1× multiplier |
How to Access
Claude.ai: Available now at claude.ai
API: Use claude-opus-4-6 via the Claude API
Cloud Platforms: Available on Amazon Bedrock and Google Cloud Vertex AI
What This Means for Developers
The Agent Teams Shift
Agent Teams represents a fundamental shift in how AI coding assistants work. Instead of a single agent working sequentially, you now have:
- Parallel execution - Multiple agents work simultaneously
- Autonomous coordination - Agents communicate without human intervention
- Specialization - Each agent can focus on its piece
- Scalability - Add more agents for larger tasks
Practical Applications
For Software Teams:
- Assign one agent to code review, another to testing, another to documentation
- Complete multi-hour tasks in parallel
- Handle codebase-wide refactoring across multiple repositories
For Enterprise:
- Process massive document sets in single sessions
- Run complex analysis with longer coherence
- Build agent orchestration systems
The Competitive Landscape
The release came just 15 minutes before OpenAI launched GPT-5.3 Codex—a clear signal that the AI coding war is intensifying.
Opus 4.6 vs GPT-5.2 (per Anthropic benchmarks):
- Knowledge work: Opus 4.6 wins by 144 Elo
- Coding: Opus 4.6 leads Terminal-Bench 2.0
- Search: Opus 4.6 leads BrowseComp
- Reasoning: Opus 4.6 leads Humanity's Last Exam
We'll need to wait for independent benchmarks comparing Opus 4.6 to the newly released GPT-5.3 Codex.
The Bottom Line
Claude Opus 4.6 is a substantial upgrade that delivers on Anthropic's promise of "smarter models that work harder, longer, and more autonomously."
Key Takeaways:
- Agent Teams enables parallel AI collaboration
- 1M context window opens new use cases
- State-of-the-art on coding, reasoning, and search
- Lowest over-refusal rate of any Claude model
- Same pricing as Opus 4.5
For developers building agentic applications, Opus 4.6's combination of agent teams, long context, and context compaction creates a compelling platform. For enterprise users, the PowerPoint and Excel integrations make Claude increasingly useful for everyday knowledge work.
The AI coding war just got more interesting.
Have you tried Claude Opus 4.6? Share your experience in the comments.
