What is Claude Mythos SWE-bench score?

Claude Mythos Preview scored 93.9% on SWE-bench Verified, the industry-standard coding benchmark. It also scored 77.8% on SWE-bench Pro (harder, real-world issues) and 59.0% on SWE-bench Multimodal (diagram + code tasks). These are the highest scores ever recorded by any AI model on these benchmarks.

What is SWE-bench Verified?

SWE-bench Verified is a curated subset of SWE-bench where each task has been human-verified to be solvable. The benchmark presents an AI with a real GitHub issue and a real codebase — the AI must write a working code patch that passes all automated tests. A 93.9% score means Mythos successfully fixed 939 out of every 1,000 real software issues it was given.

How did Claude Mythos score on USAMO 2026?

Claude Mythos scored 97.6% on USAMO 2026, the United States of America Mathematical Olympiad. USAMO features proof-based olympiad problems that require multi-step reasoning, creative insight, and rigorous justification — not just computation. No human student scores 97.6%; the top score in competition history is around 42/42.

How does Claude Mythos compare to GPT-5.4?

On every public benchmark, Claude Mythos outperforms GPT-5.4. On SWE-bench Verified: Mythos 93.9% vs GPT-5.4 57.7%. On SWE-bench Pro: Mythos 77.8% vs GPT-5.4 57.7%. On USAMO 2026: Mythos 97.6% vs GPT-5.4 71.4%. The gap is largest in real-world software engineering tasks.

What is SWE-bench Pro and why does it matter?

SWE-bench Pro is a harder version of SWE-bench featuring complex, multi-file changes in large production codebases — the kind of work senior engineers tackle. Claude Mythos scored 77.8% on SWE-bench Pro, meaning it can reliably solve genuinely hard engineering problems, not just toy exercises. Claude Opus 4.6 scored 53.4% on the same benchmark.

Claude Mythos AI Benchmarks: 93.9% SWE-bench, 97.6% USAMO & Every Record Broken (2026)

93.9%

SWE-bench Verified

97.6%

USAMO 2026

77.8%

SWE-bench Pro

59.0%

SWE-bench Multimodal

Definition: What Are AI Benchmarks and Why Do They Matter?

An AI benchmark is a standardized test designed to measure a specific capability of an AI model. Unlike vague claims ("our model is smarter"), benchmarks give researchers, developers, and businesses an objective, reproducible number they can compare across models and over time.

The challenge is that benchmarks can be gamed. A model can be fine-tuned specifically on benchmark problems — a practice called "benchmark contamination" — producing an inflated score that does not reflect real-world ability. The best benchmarks are those that are:

Contamination-resistant

Uses real-world tasks or newly created problems not in any training dataset

Practically meaningful

Measures skills that matter in production — not trivia or pattern matching

Objectively scored

Pass/fail on running actual code, not subjective human rating

Claude Mythos's benchmark scores matter because SWE-bench and USAMO meet all three criteria. SWE-bench uses real GitHub issues with automated test suites — you cannot fake a passing test. USAMO 2026 problems were brand-new at the time of testing, making contamination impossible. These scores represent genuine capability, not clever overfitting.

What the Numbers Mean: Every Claude Mythos Score Explained

Model Benchmark Comparison (2026)

SWE-bench Verified(%)

Claude Mythos

93.9%

GPT-5.4

57.7%

Claude Opus 4.6

53.4%

SWE-bench Pro(%)

Claude Mythos

77.8%

GPT-5.4

57.7%

Claude Opus 4.6

53.4%

SWE-bench Multimodal(%)

Claude Mythos

59%

GPT-5.4

31.2%

Claude Opus 4.6

27.1%

USAMO 2026 Math(%)

Claude Mythos

97.6%

GPT-5.4

71.4%

Claude Opus 4.6

66%

Source: Anthropic system card, April 2026. GPT-5.4 SWE-bench Multimodal estimated.

SWE-bench Verified: 93.9%

World Record

SWE-bench Verified presents an AI model with real GitHub issues from major open-source Python repositories. The model receives the issue description, the codebase, and must produce a working patch — no hints, no guided steps. The patch is scored by running the repository's actual test suite.

At 93.9%, Claude Mythos successfully resolves nearly 19 out of every 20 real software issues thrown at it autonomously. For context, a strong senior engineer solving unfamiliar repository issues might resolve 60–70% in the same conditions. The gap is stark.

Previous best scores:

GPT-5.4: 57.7%Claude Opus 4.6: 53.4%Claude Mythos: 93.9% (+36.2 pts)

SWE-bench Score Breakdown by Task Type

Python bug fixing

96.2%

TypeScript refactoring

94.8%

Security patch generation

91.3%

Multi-file PR generation

89.7%

Legacy C++ modernization

83.4%

Multimodal diagram-to-code

59%

Estimated task-level breakdown based on Anthropic's system card categories, April 2026.

SWE-bench Pro: 77.8%

World Record

SWE-bench Pro is the "hard mode" version — issues sourced from large, complex production codebases with multi-file changes, intricate dependency chains, and subtle interaction bugs. These are the kinds of problems that keep senior engineers busy for days.

Mythos solved 77.8% of SWE-bench Pro tasks. This is the score that has the engineering world paying attention — not just academic benchmark chasers. It means Claude Mythos can autonomously deliver production-quality patches on real, messy codebases, not just clean toy problems.

SWE-bench Multimodal: 59.0%

SWE-bench Multimodal is the newest and hardest variant. It adds visual context — architecture diagrams, screenshots of UI bugs, error screenshots — alongside the code. The model must understand the image to understand the problem.

At 59.0% — more than double Claude Opus 4.6's 27.1% — Mythos is the first model to clear the 50% threshold on this benchmark. This matters for real-world software work where bug reports often include screenshots, and architecture decisions often live in diagrams.

USAMO 2026 Math: 97.6%

World Record

The United States of America Mathematical Olympiad (USAMO) is a proof-based competition for the country's top high-school mathematicians. Problems require multi-step reasoning, creative mathematical insight, and rigorous written proof — not just correct numerical answers.

Claude Mythos scored 97.6% on the 2026 USAMO, which was administered after Mythos's training cutoff — meaning no contamination was possible. Mythos outperforms virtually every human who has ever competed at USAMO.

Combinatorics

98.4%

Number Theory

97.9%

Geometry

97.2%

Algebra

97%

Inequalities

96.1%

Estimated topic-level breakdown based on USAMO 2026 problem categories.

When: The Timeline of AI Coding Benchmark Progress

To understand what 93.9% means, you need to see where we came from. SWE-bench was introduced in 2021 and progress was slow for years — then Mythos made a discontinuous leap.

2021GPT-4 sets SWE-bench baseline at ~3.9%

Task introduced; no model could solve real GitHub issues reliably

2023Claude 2 reaches ~12% on SWE-bench

First meaningful autonomous code repair with explanation

2024Claude 3.5 Sonnet hits ~49%

Breakthrough — first model that felt developer-grade at scale

Early 2025Claude Opus 4 reaches ~53.4%

Plateau visible — progress slows for traditional approaches

Apr 2026Claude Mythos Preview: 93.9%

Discontinuous jump — new architecture + cybersecurity reasoning

The key insight: Progress from GPT-4's 3.9% to Opus 4.6's 53.4% took roughly 3 years of incremental improvement. Claude Mythos then added another +40 percentage points in a single generation. This is the kind of discontinuous jump that reshapes industries.

How Claude Mythos Achieves These Scores

Anthropic has not published Mythos's full architecture. But based on the 244-page system card and external analysis, five factors explain the benchmark breakthrough:

Extended context reasoning

Mythos can hold and reason over extremely large codebases in a single context window. Where earlier models lost coherence across multiple files, Mythos maintains consistent understanding of how thousands of functions interact — essential for SWE-bench Pro's multi-file tasks.

Verified chain-of-thought

Mythos uses extended thinking — long internal reasoning chains that it self-checks before committing to an answer. For complex proofs and multi-step code refactors, this means fewer logical errors propagating through 10+ reasoning steps.

Cybersecurity-optimized training

Mythos was trained specifically for deep code analysis — not just writing new code but understanding existing code at the level needed to find subtle security vulnerabilities. This same capability translates to finding subtle bugs in SWE-bench issues.

Tool use and agent loops

Mythos can run tests iteratively — generate a patch, run the test suite, read the failure, revise the patch. This agentic loop is how it reaches 93.9%: it does not guess once, it iterates until the tests pass.

Multimodal understanding

Vision capabilities allow Mythos to read architecture diagrams, UI screenshots, and error images. For SWE-bench Multimodal this is necessary to understand the problem at all. For real-world engineering, it means fewer lost hours translating diagrams to text descriptions.

Why These Benchmarks Matter — and Their Limits

Benchmark scores are important but they are not the full picture. Here is an honest assessment of what Mythos's scores tell you — and what they do not.

Benchmark	Claude Mythos	GPT-5.4	Claude Opus 4.6	Mythos Lead
SWE-bench Verified	93.9%	57.7%	53.4%	+36.2 pts
SWE-bench Pro	77.8%	57.7%	53.4%	+20.1 pts
SWE-bench Multimodal	59.0%	~31%	27.1%	+28.0 pts
USAMO 2026	97.6%	71.4%	~66%	+26.2 pts

What These Scores Confirm

• Autonomous code repair at senior-engineer level
• Real math reasoning, not memorized solutions
• Multi-file, production-grade codebase understanding
• First model to surpass 50% on multimodal code tasks
• Largest single-generation benchmark jump in SWE-bench history
• Reliable enough for high-stakes security work (Project Glasswing)

What These Scores Do NOT Tell You

• How it performs on entirely new programming languages
• Latency, cost, or token efficiency
• Reliability on ambiguous, poorly-specified requirements
• Performance on non-Python or non-open-source codebases
• How it handles conflicting requirements or edge-case specs
• Whether it will be publicly available (it will not)

Industry Implications of 93.9% SWE-bench

A model that autonomously resolves 93.9% of real GitHub issues is not an autocomplete tool — it is a code collaborator that rivals senior engineers on well-specified tasks. The practical consequences for software development, security research, and open-source maintenance are enormous:

Security research

Automated triage of thousands of CVEs and zero-days at machine speed — Project Glasswing is already doing this

Legacy modernization

Migrate legacy C++ or Java codebases to modern equivalents without a team of specialists

Open-source maintenance

AI-driven patch generation for critical infrastructure libraries that are under-resourced

Developer productivity

Senior engineers focus on system design while AI handles implementation and bug-fixing

⚠️ The Reason These Benchmarks Led to Restricted Access

There is a direct line between Claude Mythos's benchmark scores and Anthropic's decision to restrict access entirely. A model that autonomously finds and chains zero-day vulnerabilities at 93.9% precision is not just useful for defense — it is a powerful offensive tool in the wrong hands.

Anthropic tested Mythos extensively for misuse potential. Researchers found that with minimal prompting, Mythos could autonomously identify and chain exploits in real software at a level that previously required expert human attackers. This is the primary reason for the restricted Project Glasswing model.

The 244-page system card accompanying the model launch is the longest safety document Anthropic has ever released. It includes novel evaluation methods: emotion probes to detect internal deception, and over 20 hours of clinical psychiatrist sessions to test model alignment under adversarial pressure. The benchmarks are impressive. The safety work required to deploy them responsibly is equally unprecedented.

Frequently Asked Questions

Is SWE-bench a fair benchmark?

SWE-bench Verified is considered the most rigorous publicly available coding benchmark. It uses real GitHub issues with automated test suites — you cannot pass without writing code that actually works. The "Verified" subset removes ambiguous tasks that are disputed or unsolvable, making it a cleaner signal of genuine capability.

Can Claude Opus 4.6 solve the same tasks at a lower success rate?

Yes, but with a key difference. Claude Opus 4.6 at 53.4% can solve straightforward, well-specified issues. The remaining 46.6% includes complex multi-file changes and subtle bugs. Claude Mythos's jump to 93.9% specifically covers that harder portion — the messy, real-world cases that matter most in production.

Why is the SWE-bench Multimodal score lower than the others?

SWE-bench Multimodal is a genuinely harder task — it requires integrating visual and textual understanding simultaneously. 59.0% on multimodal is actually extraordinary given that it is more than double the previous best (Opus 4.6 at 27.1%). The gap reflects the fundamental difficulty of vision-language-code reasoning rather than a weakness of Mythos.

How does Mythos's math score compare to a top human student?

The top USAMO competitors typically score 36–42 out of 42 points. Claude Mythos scored 97.6%, which extrapolates to approximately 40.99 out of 42. This places it above the historical median for USAMO winners — students who have spent years preparing. The 2026 USAMO problems were new at the time of testing, ruling out memorization.

Will these benchmark scores eventually be available on Claude.ai?

Claude Mythos Preview is not planned for public release. The capabilities that produce 93.9% SWE-bench and 2,000+ zero-day discoveries are also the capabilities that make Anthropic unwilling to offer it via API. A constrained version of Mythos's capabilities may eventually appear in future public Claude models, but the full model will remain restricted.

What Is Mythos AI? Complete Guide →

Everything about Claude Mythos, Project Glasswing, and why access is restricted

Project Glasswing Explained →

$100M initiative, 12 tech giant partners, and Firefox zero-days found

Claude Mythos AI Benchmarks: 93.9% SWE-bench, 97.6% USAMO & Every Record Broken (2026)

Definition: What Are AI Benchmarks and Why Do They Matter?

What the Numbers Mean: Every Claude Mythos Score Explained

Model Benchmark Comparison (2026)

SWE-bench Verified: 93.9%

SWE-bench Score Breakdown by Task Type

SWE-bench Pro: 77.8%

SWE-bench Multimodal: 59.0%

USAMO 2026 Math: 97.6%

When: The Timeline of AI Coding Benchmark Progress

How Claude Mythos Achieves These Scores

Extended context reasoning

Verified chain-of-thought

Cybersecurity-optimized training

Tool use and agent loops

Multimodal understanding

Why These Benchmarks Matter — and Their Limits

What These Scores Confirm

What These Scores Do NOT Tell You

Industry Implications of 93.9% SWE-bench

⚠️ The Reason These Benchmarks Led to Restricted Access

Frequently Asked Questions

Related Articles

Share this article with Your Friends, Collegue and Team mates

Stay Updated

Share Your Feedback

Related AI & Security Guides