Gaia2 Leaderboard 🏆

Gaia2-CLI Leaderboard

7 models · higher is better · Test your own model / harness here

Model	Provider	Harness	pass@1	Search	Execution	Adaptability	Ambiguity	Time	Date
🥇 Claude Opus 4.6 (high)	Anthropic	OpenClaw 2026.4.1	57.0%	88.1%	82.9%	61.9%	48.3%	3.8%	2026-04-13
🥈 GPT-5.5 (xhigh)	OpenAI	OpenClaw 2026.4.1	56.4%	96.2%	79.4%	55.0%	46.9%	4.4%	2026-05-15
🥉 GPT-5.4 (high)	OpenAI	OpenClaw 2026.4.1	55.6%	94.8%	78.8%	54.8%	47.3%	2.5%	2026-04-13
4 Gemini 3.1 Pro (high)	Google	OpenClaw 2026.4.1	52.0%	92.8%	78.6%	45.9%	40.6%	2.1%	2026-04-14
5 Claude Sonnet 4.6 (high)	Anthropic	OpenClaw 2026.4.1	51.9%	82.5%	75.8%	55.7%	40.4%	5.0%	2026-04-13
6 GLM 5.1 (enabled)	OpenRouter*	OpenClaw 2026.4.1	50.5%	83.8%	71.2%	56.9%	39.4%	1.2%	2026-04-13
7 Kimi-K2.5 (enabled)	OpenRouter*	OpenClaw 2026.4.1	34.0%	62.2%	47.0%	43.4%	16.6%	0.8%	2026-04-14

* Accessed via OpenRouter. The harness does not round-trip reasoning context between turns for this provider, which may affect multi-step performance.

Vanilla Gaia2 Leaderboard

Original benchmark with noise and agent-to-agent splits


Claude-4-Sonnet Thinking	Anthropic	42.1	69.2	79.6	51.9	40.4	8.5	35.4	17.9	Meta	2025-09-09


GPT-5 (high)	OpenAI	42.1	69.2	79.6	51.9	40.4	0	35.4	17.9	Meta	2025-09-09
Claude-4-Sonnet Thinking	Anthropic	37.8	62.1	60.6	27.3	42.1	8.5	31.2	32.5	Meta	2025-09-29
Claude-4-Sonnet	Anthropic	34.8	57.9	59.8	24.2	38.1	8.1	27.7	27.9	Meta	2025-09-09
GPT-5 (low)	OpenAI	34.6	52.7	64.2	39.6	30.2	2.3	28.3	24.6	Meta	2025-09-09
Gemini-2.5-Pro	Google	25.8	39.2	57.7	18.1	17.5	7.3	20.4	20.4	Meta	2025-09-09
DeepSeek-v3.1 Terminus	DeepSeek	23.1	43.1	34.4	13.1	32.5	1.9	17.5	19.4	Meta	2025-09-30
DeepSeek-v3.1	DeepSeek	21.9	39.8	36.2	11.2	31.2	1.7	17.3	16	Meta	2025-09-29
Kimi-K2	Moonshot	20.1	34.2	36	8.3	24	0.8	18.8	18.3	Meta	2025-09-09
GPT-5 (minimal)	OpenAI	18.2	31.9	26.2	20.6	19.2	5.2	13.1	11.5	Meta	2025-09-09
Grok-4	xAI	15.7	8.8	57.5	9.4	4.4	0	15.6	14.4	Meta	2025-09-09
Qwen3-235B-thinking	Alibaba	15.7	28.1	36.2	10	16.2	0	6.9	12.5	Meta	2025-09-29
GPT-OSS 120B (high)	OpenAI	13.7	17.9	33.1	8.3	10.6	0.6	14.6	10.6	Meta	2025-09-29
Qwen3-235B	Alibaba	11.6	22.7	22.3	6.5	8.1	1.2	10.8	9.4	Meta	2025-09-09
GPT-4o	OpenAI	7.4	8.3	17.5	4.4	6.2	5.8	4.6	5.2	Meta	2025-09-09
Llama 4 Maverick	Meta	7.4	13.8	14.4	2.1	5	1.2	6.2	9.2	Meta	2025-09-09
Llama 3.3 70B Instruct	Meta	4.4	7.1	11.5	1.7	1.9	0.4	3.8	4.6	Meta	2025-09-09

Run the Benchmark

Clone the gaia2-cli repository and follow the setup instructions
Run the benchmark on all 5 splits (execution, search, ambiguity, adaptability, time)
Contact us to validate and update your scores on the leaderboard

Contact: Open an issue on the GitHub repository and we will review and add your scores.

GitHub Paper