Benchmark Scores (Difficulty-Weighted Accuracy) of
PolyMath. “*” represents a reasoning model, otherwise a non-reasoning model.
# | Model | Snapshot | Overall | en | zh | ar | bn | de | es | fr | id | it | ja | ko | ms | pt | ru | sw | te | th | vi |
1 | *Qwen-3-235B-A22B-Thinking🥇 | --- | 54.6 | 53.3 | 54.0 | 50.5 | 52.5 | 55.8 | 54.1 | 53.7 | 55.4 | 57.3 | 57.4 | 53.5 | 57.4 | 52.7 | 55.7 | 58.3 | 53.1 | 53.6 | 55.1 |
2 | *Gemini-2.5-pro-preview🥈 | 2025-03-25 | 52.2 | 50.8 | 51.4 | 49.4 | 49.5 | 54.5 | 52.9 | 52.6 | 53.8 | 54.4 | 52.8 | 51.3 | 54.1 | 55.4 | 50.0 | 52.2 | 52.3 | 50.7 | 52.1 |
3 | *Deepseek-R1-671B🥉 | --- | 47.0 | 46.5 | 46.4 | 45.5 | 43.3 | 47.4 | 45.2 | 45.9 | 49.9 | 51.0 | 47.3 | 46.9 | 50.1 | 46.0 | 42.2 | 49.1 | 48.9 | 44.7 | 50.2 |
4 | *Qwen-QwQ-32B | --- | 45.9 | 44.9 | 48.7 | 41.0 | 38.5 | 52.1 | 47.5 | 41.3 | 51.0 | 52.2 | 49.4 | 47.3 | 49.3 | 39.9 | 40.0 | 48.7 | 43.6 | 41.3 | 49.2 |
5 | *OpenAI-o3-mini-medium | 2025-01-31 | 38.6 | 36.6 | 38.8 | 42.1 | 35.0 | 40.9 | 40.4 | 40.5 | 40.9 | 37.9 | 37.7 | 39.5 | 39.2 | 40.8 | 39.5 | 36.1 | 31.2 | 37.4 | 40.3 |
6 | *Gemini-2.0-flash-thinking | 2025-01-21 | 37.1 | 40.4 | 37.0 | 36.5 | 35.4 | 38.4 | 39.8 | 38.8 | 37.7 | 40.5 | 33.0 | 33.9 | 37.9 | 38.6 | 38.3 | 36.7 | 31.8 | 37.6 | 35.3 |
7 | *OpenAI-o1-mini | 2024-09-12 | 36.6 | 37.7 | 38.7 | 37.9 | 35.1 | 36.6 | 38.0 | 36.6 | 35.4 | 39.0 | 38.5 | 36.8 | 35.2 | 36.4 | 36.9 | 35.3 | 32.6 | 36.0 | 35.5 |
8 | *Claude-3.7-sonnet-thinking | 2025-02-19 | 33.5 | 35.7 | 32.6 | 34.6 | 33.7 | 32.3 | 30.2 | 32.7 | 34.0 | 34.6 | 34.2 | 32.8 | 34.8 | 31.8 | 35.6 | 34.2 | 30.9 | 36.0 | 31.5 |
9 | *Grok-3-mini | 2025-02-17 | 32.3 | 29.9 | 33.0 | 29.5 | 27.7 | 37.2 | 33.9 | 31.2 | 32.5 | 36.4 | 33.8 | 32.3 | 33.1 | 30.9 | 33.1 | 31.6 | 31.0 | 33.8 | 30.5 |
10 | GPT-4.5-Preview | 2025-02-27 | 26.9 | 31.1 | 28.6 | 26.5 | 25.3 | 25.6 | 26.9 | 28.1 | 27.5 | 29.0 | 26.6 | 25.4 | 27.4 | 27.8 | 29.3 | 25.0 | 23.1 | 24.1 | 27.8 |
11 | ChatGPT-4o-latest | 2025-03-26 | 24.3 | 27.9 | 26.9 | 23.0 | 23.1 | 24.7 | 25.4 | 26.7 | 24.2 | 27.0 | 21.8 | 22.9 | 24.6 | 23.6 | 25.4 | 22.0 | 22.4 | 21.7 | 23.6 |
12 | Qwen2.5-Math-72B-Instruct | --- | 21.0 | 21.2 | 22.0 | 22.5 | 20.9 | 22.0 | 21.8 | 23.6 | 19.4 | 22.0 | 20.2 | 20.6 | 21.8 | 22.0 | 19.7 | 17.5 | 17.9 | 20.9 | 21.3 |
13 | Deepseek-v3 | 2024-12-26 | 20.4 | 21.5 | 21.1 | 21.5 | 17.6 | 20.1 | 22.1 | 22.6 | 20.2 | 22.6 | 21.0 | 19.6 | 21.3 | 20.4 | 21.0 | 19.0 | 15.7 | 20.3 | 20.3 |
14 | Claude-3.7-sonnet | 2025-02-19 | 19.7 | 22.3 | 21.2 | 21.0 | 17.2 | 20.5 | 20.6 | 20.9 | 21.2 | 20.6 | 17.2 | 16.8 | 20.8 | 19.8 | 18.8 | 19.3 | 17.3 | 19.0 | 20.5 |
15 | Qwen-2.5-Max | --- | 19.4 | 22.1 | 17.4 | 20.8 | 16.1 | 20.1 | 21.2 | 20.8 | 20.8 | 20.2 | 17.1 | 19.0 | 20.6 | 20.6 | 20.1 | 17.3 | 16.9 | 18.1 | 19.3 |
16 | Qwen2.5-72B-Instruct | --- | 16.9 | 19.8 | 18.1 | 15.6 | 17.7 | 16.3 | 19.6 | 14.7 | 19.3 | 18.8 | 15.7 | 16.6 | 16.7 | 17.7 | 17.7 | 11.3 | 13.3 | 17.4 | 18.6 |
17 | Llama-3.3-70B-Instruct | --- | 11.5 | 21.0 | 9.9 | 9.7 | 6.5 | 11.1 | 11.2 | 11.0 | 12.5 | 14.4 | 7.0 | 8.4 | 14.5 | 16.0 | 12.1 | 10.5 | 8.3 | 10.9 | 12.1 |
ja: Japanese   ko: Korean   ms: Malay   pt: Portuguese   ru: Russian   sw: Swahili   te: Telugu   th: Thai   vi: Vietnamese  
🚨 We keep updating the results in the leaderboard!
🚨 For more evaluation details, please refer to our paper.