Deep Dive
1M token context window. Autonomous agent teams. 144 Elo points ahead of GPT-5.2 on knowledge work. A technical breakdown of what Opus 4.6 actually delivers.
Opus 4.6 introduces a 1M token context window (currently in beta) — the first Opus-class model to reach this scale. More importantly, it solves "context rot," where performance degrades as context length increases.
76% accuracy vs 18.5% — a 4x improvement in retrieving information from massive contexts.
The 128K token output capacity is significant for enterprise workflows. Previously, large generation tasks required multiple API calls or post-processing. Now a single request can generate entire code migrations, comprehensive reports, or multi-file projects without returning incomplete results.
The model implements a more sophisticated reasoning loop where it "thinks more deeply and more carefully revisits its reasoning before settling on an answer." This represents architectural changes in how the model approaches problem-solving.
The tradeoff is latency and cost: simple queries might be slower because the model does unnecessary deep thinking. To address this, Anthropic introduced the /effort parameter.
Beyond manual effort controls, the model learns to recognize when a problem requires deep reasoning versus quick processing, optimizing for both quality and efficiency contextually.
Quick responses for simple queries. Minimal deliberation overhead.
Balanced reasoning for standard tasks. Good quality-to-latency ratio.
Deep analysis for complex problems. Extended reasoning chains.
Full deliberation for the hardest problems. The model revisits and self-corrects extensively.
These aren't marginal improvements — especially the life sciences jump and the Elo advantage over GPT-5.2 represent meaningful capability leaps.
Economically Valuable Knowledge Work
144 Elo points ahead of GPT-5.2, 190 points ahead of Opus 4.5
Agentic Coding
Highest scores — demonstrating superior planning and self-correction in large codebases
Multidisciplinary Reasoning
Top performance across multidisciplinary reasoning challenges
Information Retrieval
Best industry scores for locating hard-to-find information
Computational & Structural Biology
Nearly 2x performance improvement over Opus 4.5 on computational/structural biology
Legal Reasoning
90.2% accuracy — strong document understanding and legal reasoning
Opus 4.6 points toward a future where AI systems don't just respond to prompts but actively drive workflows with minimal human oversight.
A fundamental shift in how Claude Code works. Rather than a single AI assistant, multiple agents collaborate autonomously.
One organization's Opus 4.6 agent "autonomously closed 13 issues and assigned 12 issues to the right team members in a single day, managing a ~50-person organization."
Automatically summarizes older conversation context when approaching the 1M token limit, enabling effectively infinite agent task duration. Critical for long-running operations that previously would have hit context walls.
Moves beyond manual effort controls — the model learns to recognize when a problem requires deep reasoning versus quick processing, optimizing for both quality and efficiency contextually.
Partners highlight its ability to "navigate large codebases and identify the right changes to make at state-of-the-art levels." Combined with Terminal-Bench dominance, it positions as a serious tool for code review, refactoring, and bug-fixing in complex systems.
90.2% accuracy on BigLaw Bench suggests strong document understanding and legal reasoning. The long context window is particularly valuable for reviewing multi-document contracts and understanding regulatory frameworks holistically.
The life sciences improvements point toward accelerating scientific workflows. The ability to synthesize information from multiple sources with proper long-context retrieval creates new possibilities for literature reviews, hypothesis generation, and data interpretation.
The model's long-context capabilities and reasoning depth translate directly into better document analysis, data interpretation, and structured output generation for enterprise productivity workflows.
Enhanced Claude in Excel handles longer-running analyses and multi-step spreadsheet transformations. Claude in PowerPoint (research preview) reads design systems and slide masters to generate on-brand presentations — showing architectural understanding of document structure beyond simple pattern matching.
Flat pricing. Same as previous Opus tiers. 5:1 output-to-input price ratio reflects the higher computational cost of generation.
For inputs exceeding 200K tokens. 2x input and 1.5x output premium reflects the computational cost of large-context processing.
1.1x standard pricing for US-only inference — for data residency and compliance requirements.
Immediately available across all platforms — suggesting Anthropic's confidence in the model's stability and performance.
Anthropic characterizes Opus 4.6 as maintaining parity with Opus 4.5 — their "most-aligned frontier model."
Low misalignment rates across deception, sycophancy, and other behavioral axes.
Lowest over-refusal rates of recent Claude versions — the model is less likely to refuse legitimate requests.
Enhanced cybersecurity testing given the model's strengthened security capabilities.
Extensive automated behavioral audits across safety dimensions.
The emphasis on "lowest over-refusal rates" is notable — previous Claudes sometimes erred toward refusing legitimate requests. Opus 4.6 appears to have better calibration, accepting legitimate requests while maintaining safety boundaries. However, Anthropic is implementing enhanced safeguards given the model's increased capabilities, particularly around defensive cybersecurity (which they encourage) versus offensive use cases.
The 144 Elo point advantage over GPT-5.2 on economically valuable knowledge work is significant.
Partner feedback from GitHub, Replit, Asana, and Cognition consistently emphasizes that Opus 4.6 "actually follows through" on complex requests — breaking them into concrete steps, executing reliably, and producing polished results.
The advantage isn't just in raw capability but in reliability and planning quality — traits that matter more than raw throughput for professional users.
For developers upgrading to Opus 4.6, the main decision is calibrating the /effort parameter. Setting it to "max" on simple queries wastes tokens and increases latency.
The model's intelligence means overly explicit prompts may underperform — trusting the model with ambiguous instructions often produces better results than hand-holding it through every step.
The context compaction feature being in beta suggests it's still stabilizing — early adopters may encounter edge cases where summarization loses nuance.
Opus 4.6 represents a meaningful jump in autonomous reasoning capability, long-context handling, and agentic task execution. The improvements are particularly pronounced for professional workflows involving code, documents, research, and multi-step reasoning.
The 1M context window and context compaction solve fundamental limitations that have constrained AI usefulness in enterprise scenarios. Combined with lower over-refusal rates and strong safety profiles, it appears positioned to set a new bar for frontier models in professional use cases.