Gaia2 Leaderboard 🏆

Gaia2 is a benchmark designed to measure general agent capabilities. Beyond traditional search and execution tasks, Gaia2 runs asynchronously, requiring agents to handle ambiguities, adapt to dynamic environments, and operate under temporal constraints.

Gaia2 evaluates agents across the following dimensions: Execution (instruction following, multi-step tool-use), Search (information retrieval), Ambiguity (handling unclear or incomplete instructions), Adaptability (responding to dynamic environment changes), and Time (managing temporal constraints and scheduling).

⚠️ All scores on this page are reported by the submitting team.

Gaia2-CLI Leaderboard

6 models · higher is better · Test your own model / harness here

Model Provider Harness pass@1 Search Execution Adaptability Ambiguity Time Date
🥇 Claude Opus 4.6 (high)
Anthropic OpenClaw 2026.4.1 57.0%
88.1%
82.9%
61.9%
48.3%
3.8%
2026-04-13
🥈 GPT-5.4 (high)
OpenAI OpenClaw 2026.4.1 55.6%
94.8%
78.8%
54.8%
47.3%
2.5%
2026-04-13
🥉 Gemini 3.1 Pro (high)
Google OpenClaw 2026.4.1 52.0%
92.8%
78.6%
45.9%
40.6%
2.1%
2026-04-14
4 Claude Sonnet 4.6 (high)
Anthropic OpenClaw 2026.4.1 51.9%
82.5%
75.8%
55.7%
40.4%
5.0%
2026-04-13
5 GLM 5.1 (enabled)
OpenRouter* OpenClaw 2026.4.1 50.5%
83.8%
71.2%
56.9%
39.4%
1.2%
2026-04-13
6 Kimi-K2.5 (enabled)
OpenRouter* OpenClaw 2026.4.1 34.0%
62.2%
47.0%
43.4%
16.6%
0.8%
2026-04-14

* Accessed via OpenRouter. The harness does not round-trip reasoning context between turns for this provider, which may affect multi-step performance.


Vanilla Gaia2 Leaderboard

Original benchmark with noise and agent-to-agent splits


Run the Benchmark

  1. Clone the gaia2-cli repository and follow the setup instructions
  2. Run the benchmark on all 5 splits (execution, search, ambiguity, adaptability, time)
  3. Contact us to validate and update your scores on the leaderboard

Contact: Open an issue on the GitHub repository and we will review and add your scores.