AI Leaderboards
Top 10 models across major benchmarks
Top 10 models across major benchmarks
Scores across benchmarks
| # | Model | LM Arena ↗ | SWE-bench ↗ | ARC-AGI-2 ↗ |
|---|---|---|---|---|
| 🥇 | Claude Opus 4.5 | 1468 | 74.4% | - |
| 🥈 | GPT-5.2 | - | 71.8% | - |
| 🥉 | Gemini 3 Pro | 1489 | 74.2% | - |
| 4 | Grok 4 | 1477 | - | - |
| 5 | DeepSeek V3 | - | 60.0% | - |
| 6 | Qwen3 | - | 55.4% | - |
Crowdsourced human evaluations
| # | Model | Score | Votes |
|---|---|---|---|
| 1 | gemini-3-pro | 1489 | 26,385 |
| 2 | grok-4.1-thinking | 1477 | 26,505 |
| 3 | gemini-3-flash | 1471 | 11,599 |
| 4 | claude-opus-4-5-20251101-thinking-32k | 1468 | 18,518 |
| 5 | claude-opus-4-5-20251101 | 1467 | 19,770 |
| 6 | grok-4.1 | 1466 | 30,490 |
| 7 | gemini-3-flash (thinking-minimal) | 1464 | 5,530 |
| 9 | gpt-5.1-high | 1460 | 23,068 |
| 10 | claude-sonnet-4-5-20250929-thinking-32k | 1452 | 37,043 |
Real-world software engineering tasks
| # | Model | % Resolved |
|---|---|---|
| 1 | Claude 4.5 Opus medium (20251101) | 74.4% |
| 2 | Gemini 3 Pro Preview (2025-11-18) | 74.2% |
| 3 | GPT-5.2 (2025-12-11) (high reasoning) | 71.8% |
| 4 | Claude 4.5 Sonnet (20250929) | 70.6% |
| 5 | GPT-5.2 (2025-12-11) | 69.0% |
| 6 | Claude 4 Opus (20250514) | 67.6% |
| 7 | GPT-5.1-codex (medium reasoning) | 66.0% |
| 8 | GPT-5.1 (2025-11-13) (medium reasoning) | 66.0% |
| 9 | GPT-5 (2025-08-07) (medium reasoning) | 65.0% |
| 10 | Claude 4 Sonnet (20250514) | 64.9% |