AI Leaderboard

Socratic Score = mean of normalized benchmarks. LM Arena: (ELO - 1000) / 400 × 100. Vending-Bench: balance / $10k × 100. SWE-bench, ARC-AGI, and HLE are used as-is (0-100%).

AI Socratic Leaderboard

Scores across benchmarks

#	Model	Score	LM Arena ↗	SWE-bench ↗	ARC-AGI-2 ↗	HLE ↗	Vending ↗	Vibe Bench
🥇	GPT-5.2	81.9	1482	72.8%	72.9%($38.99)	-	-	24%
🥈	Claude Opus 4.6	81.2	1502	75.6%	69.2%($3.47)	-	$8,017.59	50%
🥉	Gemini 3 Pro	63.3	1486	69.6%	54.0%($30.57)	38.3%	$5,478.16	16%
4	Grok 4	50.1	1492	-	29.4%($30.40)	24.5%	$4,662.85	3%
5	DeepSeek V3	37.0	-	70.0%	4.0%($0.12)	-	-	-
6	Qwen3	1.3	-	-	1.3%($0.00)	-	-	-

Vibe Bench

Community favorites · Socratic Feb · 38 responses

Claude	19
Claude Code	12
ChatGPT	9
Gemini	6
Codex	6
Open Source	5
Cursor	4
Other Tools	2
Grok	1
Perplexity	0
Windsurf	0

Vibe Bench: trends

Mentions per event (absolute count)

SWE-bench Bash

Verified

Real-world software engineering tasks

#	Model	% Resolved
🥇	Claude 4.5 Opus (high reasoning)	76.8%
🥈	Gemini 3 Flash (high reasoning)	75.8%
🥉	MiniMax M2.5 (high reasoning)	75.8%
4	Claude Opus 4.6	75.6%
5	GPT-5-2 Codex	72.8%
6	GLM-5 (high reasoning)	72.8%
7	GPT-5-2 (high reasoning)	72.8%
8	GPT 5.2 Codex	72.8%
9	Claude 4.5 Sonnet (high reasoning)	71.4%
10	Kimi K2.5 (high reasoning)	70.8%

View full leaderboard →

ARC-AGI-2

Semi-Private

Abstract reasoning capabilities

#	Model	Score
🥇	Gemini 3 Deep Think (2/26)	84.6%
🥈	GPT-5.4 Pro (xHigh)	83.3%
🥉	Gemini 3.1 Pro (Preview)	77.1%
4	GPT-5.4 (xHigh)	74.0%
5	GPT-5.2 (Refine.)	72.9%
6	Claude Opus 4.6 (120K, High)	69.2%
7	Claude Opus 4.6 (120K, Max)	68.8%
8	GPT-5.4 (High)	67.5%
9	Claude Opus 4.6 (120K, Medium)	66.3%
10	Claude Opus 4.6 (120K, Low)	64.6%

View full leaderboard →

Humanity's Last Exam

HLE

Expert-level reasoning across disciplines

#	Model	Accuracy
🥇	Gemini 3 Pro	38.3%
🥈	GPT-5	25.3%
🥉	Grok 4	24.5%
4	Gemini 2.5 Pro	21.6%
5	GPT-5-mini	19.4%
6	Claude 4.5 Sonnet	13.7%
7	Gemini 2.5 Flash	12.1%
8	DeepSeek-R1*	8.5%
9	o1	8.0%
10	GPT-4o	2.7%

View full leaderboard →

LM Arena - Text

1 day ago

Crowdsourced human evaluations

#	Model	Score	Votes
🥇	claude-opus-4-6-thinking	1502	11,801
🥈	claude-opus-4-6	1501	12,546
🥉	gemini-3.1-pro-preview	1493	14,677
4	grok-4.20-beta1	1492	7,396
5	gemini-3-pro	1486	41,762
6	gpt-5.4-high	1485	4,965
7	gpt-5.2-chat-latest-20260210	1482	10,140
8	grok-4.20-beta-0309-reasoning	1481	4,504
9	gemini-3-flash	1475	31,060
10	claude-opus-4-5-20251101-thinking-32k	1474	37,036

View full leaderboard →

Vending-Bench 2

Andon Labs

Long-term agentic coherence

#	Model	Balance
🥇	Claude Opus 4.6	$8,017.59
🥈	Claude Sonnet 4.6	$7,204.14
🥉	GPT-5.4 New	$6,144.18
4	GPT-5.3-Codex	$5,940.12
5	Gemini 3 Pro	$5,478.16
6	Claude Opus 4.5	$4,967.06
7	Grok 4.20 New	$4,662.85
8	GLM-5	$4,432.12
9	Claude Sonnet 4.5	$3,838.74
10	Gemini 3.1 Pro Custom Tools	$3,774.25

View full leaderboard →

AI Leaderboard

AI Socratic Leaderboard

Vibe Bench

Vibe Bench: trends

SWE-bench Bash

ARC-AGI-2

Humanity's Last Exam

LM Arena - Text

Vending-Bench 2

Search