Logo PolyMath

Evaluating Mathematical Reasoning in Multilingual Contexts

1Qwen Team, Alibaba Group
2Shanghai Jiao Tong University
geometric reasoning

Overall benchmark scores of various advanced LLMs in our Logo PolyMath.

Introduction

Mathematics serves as a fundamental field for evaluating LLM reasoning intelligence. However, the in-depth relationship between “language” and “reasoning” remains underexplored. Current popular multilingual mathematical datasets are too simple to evaluate the reasoning capabilities of advanced LLMs effectively. This suggests that multilingual mathematical reasoning benchmarks have not kept pace with the progress in LLM reasoning abilities, as most advanced and challenging benchmarks are available only in English. Therefore, creating a challenging multilingual mathematical reasoning benchmark is crucial for studying multilingual reasoning abilities in the current era of LLMs.

To bridge this gap, we build Logo PolyMath, a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels, with 9,000 samples in total. Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation, making it a highly discriminative multilingual mathematical benchmark in the era of reasoning LLMs.

With Logo PolyMath, we conduct extensive experiments on advanced non-reasoning and reasoning LLMs, finding that even Qwen-3-235B-A22B-Thinking and Gemini-2.5-pro-preview achieve benchmark scores of only 54.6 and 52.2. Crucially, reasoning performance varies widely across languages, with differences of up to 10 points even at low accuracy settings. Beyond performance, we further explore the consistency between input and output languages, the variation in thinking length across different languages, and the impact of controlling the model's response language on performance. These offer insights into the slow-thinking pattern in multilingual contexts. We hope Logo PolyMath serves as a strong benchmark to advance the research of multilingual mathematical reasoning.

Leaderboard on PolyMath

Benchmark Scores (Difficulty-Weighted Accuracy) of Logo PolyMath. “*” represents a reasoning model, otherwise a non-reasoning model.

# Model Snapshot Overall en zh ar bn de es fr id it ja ko ms pt ru sw te th vi
1 *Qwen-3-235B-A22B-Thinking🥇 --- 54.6 53.3 54.0 50.5 52.5 55.8 54.1 53.7 55.4 57.3 57.4 53.5 57.4 52.7 55.7 58.3 53.1 53.6 55.1
2 *Gemini-2.5-pro-preview🥈 2025-03-25 52.2 50.8 51.4 49.4 49.5 54.5 52.9 52.6 53.8 54.4 52.8 51.3 54.1 55.4 50.0 52.2 52.3 50.7 52.1
3 *Deepseek-R1-671B🥉 --- 47.0 46.5 46.4 45.5 43.3 47.4 45.2 45.9 49.9 51.0 47.3 46.9 50.1 46.0 42.2 49.1 48.9 44.7 50.2
4 *Qwen-QwQ-32B --- 45.9 44.9 48.7 41.0 38.5 52.1 47.5 41.3 51.0 52.2 49.4 47.3 49.3 39.9 40.0 48.7 43.6 41.3 49.2
5 *OpenAI-o3-mini-medium 2025-01-31 38.6 36.6 38.8 42.1 35.0 40.9 40.4 40.5 40.9 37.9 37.7 39.5 39.2 40.8 39.5 36.1 31.2 37.4 40.3
6 *Gemini-2.0-flash-thinking 2025-01-21 37.1 40.4 37.0 36.5 35.4 38.4 39.8 38.8 37.7 40.5 33.0 33.9 37.9 38.6 38.3 36.7 31.8 37.6 35.3
7 *OpenAI-o1-mini 2024-09-12 36.6 37.7 38.7 37.9 35.1 36.6 38.0 36.6 35.4 39.0 38.5 36.8 35.2 36.4 36.9 35.3 32.6 36.0 35.5
8 *Claude-3.7-sonnet-thinking 2025-02-19 33.5 35.7 32.6 34.6 33.7 32.3 30.2 32.7 34.0 34.6 34.2 32.8 34.8 31.8 35.6 34.2 30.9 36.0 31.5
9 *Grok-3-mini 2025-02-17 32.3 29.9 33.0 29.5 27.7 37.2 33.9 31.2 32.5 36.4 33.8 32.3 33.1 30.9 33.1 31.6 31.0 33.8 30.5
10 GPT-4.5-Preview 2025-02-27 26.9 31.1 28.6 26.5 25.3 25.6 26.9 28.1 27.5 29.0 26.6 25.4 27.4 27.8 29.3 25.0 23.1 24.1 27.8
11 ChatGPT-4o-latest 2025-03-26 24.3 27.9 26.9 23.0 23.1 24.7 25.4 26.7 24.2 27.0 21.8 22.9 24.6 23.6 25.4 22.0 22.4 21.7 23.6
12 Qwen2.5-Math-72B-Instruct --- 21.0 21.2 22.0 22.5 20.9 22.0 21.8 23.6 19.4 22.0 20.2 20.6 21.8 22.0 19.7 17.5 17.9 20.9 21.3
13 Deepseek-v3 2024-12-26 20.4 21.5 21.1 21.5 17.6 20.1 22.1 22.6 20.2 22.6 21.0 19.6 21.3 20.4 21.0 19.0 15.7 20.3 20.3
14 Claude-3.7-sonnet 2025-02-19 19.7 22.3 21.2 21.0 17.2 20.5 20.6 20.9 21.2 20.6 17.2 16.8 20.8 19.8 18.8 19.3 17.3 19.0 20.5
15 Qwen-2.5-Max --- 19.4 22.1 17.4 20.8 16.1 20.1 21.2 20.8 20.8 20.2 17.1 19.0 20.6 20.6 20.1 17.3 16.9 18.1 19.3
16 Qwen2.5-72B-Instruct --- 16.9 19.8 18.1 15.6 17.7 16.3 19.6 14.7 19.3 18.8 15.7 16.6 16.7 17.7 17.7 11.3 13.3 17.4 18.6
17 Llama-3.3-70B-Instruct --- 11.5 21.0 9.9 9.7 6.5 11.1 11.2 11.0 12.5 14.4 7.0 8.4 14.5 16.0 12.1 10.5 8.3 10.9 12.1
en: English    zh: Chinese    ar: Arabic    bn: Bengali    de: German    es: Spanish    fr: French    id: Indonesian    it: Italian   
ja: Japanese    ko: Korean    ms: Malay    pt: Portuguese    ru: Russian    sw: Swahili    te: Telugu    th: Thai    vi: Vietnamese   

🚨 We keep updating the results in the leaderboard!

🚨 For more evaluation details, please refer to our paper.

Logo PolyMath Dataset

Overview

Logo PolyMath is designed with the following key principles:

  • Broad Difficulty Range: We define and partition difficulty levels in the mathematical field through two key dimensions: Thought Depth and Knowledge Breadth. Thought Depth corresponds to human IQ, while Knowledge Breadth represents the extent of a person’s mathematical knowledge.
  • Language Diversity: Each problem in PolyMath is available in 18 parallel language versions, encompassing over 75% of the world’s native speakers and major language families, ensuring diversity across both high-resource and low-resource languages.
  • High-Quality Translations: Each translation is calibrated by language experts, avoiding direct use of LLM-generated outputs and ensuring precise term and logical clarity.
In total, Logo PolyMath includes 9,000 examples (125*4*18) collected through incorporating existing publicly available benchmarks and scraping official repositories from the internet.

You can download the dataset on Hugging Face Dataset.

data-overview

Broad Difficulty Range: Difficulty level partition in our Logo PolyMath
and corresponding explanations.

data-composition

Language Diversity: Detailed information of
all 18 languages supported by our Logo PolyMath.

geometric reasoning

High-Quality Translations: The number of samples where annotators consider there are content errors (content disagreement) and fluency issues (fluency disagreement) in GPT-4o’s pre-translation results for each language and level.

Examples

Examples of each language version in Logo PolyMath

geometric reasoning

Illustration and question-answer examples of our Logo PolyMath benchmark.

Statistics

Notable statistics of Logo PolyMath

BibTeX


    @misc{wang2025polymath,
          title={PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts}, 
          author={Yiming Wang and Pei Zhang and Jialong Tang and Haoran Wei and Baosong Yang and Rui Wang and Chenshu Sun and Feitong Sun and Jiran Zhang and Junxuan Wu and Qiqian Cang and Yichang Zhang and Fei Huang and Junyang Lin and Fei Huang and Jingren Zhou},
          year={2025},
          eprint={2504.18428},
          archivePrefix={arXiv},
          primaryClass={cs.CL},
          url={https://arxiv.org/abs/2504.18428}, 
    }