Model Performance Leaderboard
#1 | O3 Mini High | 79.5% | 35 | 44 |
---|---|---|---|---|
#2 | O1 High | 75.0% | 33 | 44 |
#3 | Deepseek_deepseek R1 | 63.6% | 28 | 44 |
#4 | O1 Mini | 50.0% | 21 | 42 |
#5 | Gemini 2.0 Flash Thinking Exp 1219 | 47.6% | 20 | 42 |
#6 | Gemini 2.0 Flash Exp | 45.2% | 19 | 42 |
#7 | Qwen_qwq 32b Preview | 33.3% | 14 | 42 |
#8 | Gemini Exp 1206 | 31.0% | 13 | 42 |
#9 | Deepseek_deepseek Chat V3 | 28.6% | 12 | 42 |
#10 | Deepseek_deepseek R1 Distill Llama 70b | 25.0% | 11 | 44 |
#11 | X Ai_grok 2 1212 | 21.4% | 9 | 42 |
#12 | Eurus 2 7b Prime Q8 0 | 19.0% | 8 | 42 |
#13 | Qwen_qvq 72b Preview | 19.0% | 8 | 42 |
#14 | Qwen_Qwen2.5 72B Instruct | 16.7% | 7 | 42 |
#15 | Minimax_minimax 01 | 15.9% | 7 | 44 |
#16 | Gpt 4o | 14.3% | 6 | 42 |
#17 | Mistralai_mistral Large 2411 | 14.3% | 6 | 42 |
#18 | Phi 4 Q4 K M | 11.9% | 5 | 42 |
#19 | Anthropic_claude 3.5 Sonnet | 11.4% | 5 | 44 |
#20 | Cohere_command R Plus 08 2024 | 9.5% | 4 | 42 |
#21 | PowerInfer_SmallThinker 3B Preview | 7.1% | 3 | 42 |
#22 | Amazon_nova Pro V1 | 7.1% | 3 | 42 |
#23 | Liquid_lfm 40b | 7.1% | 3 | 42 |
#24 | Qwen_qwen Max | 6.8% | 3 | 44 |
#25 | Qwen_qwen 2.5 7b Instruct | 4.8% | 2 | 42 |
Qwen_qwen 2.5 7b Instruct
PowerInfer_SmallThinker 3B Preview
Amazon_nova Pro V1
Liquid_lfm 40b
Qwen_qwen Max
Cohere_command R Plus 08 2024
Anthropic_claude 3.5 Sonnet
Phi 4 Q4 K M
Gpt 4o
Mistralai_mistral Large 2411
Qwen_Qwen2.5 72B Instruct
Minimax_minimax 01
Eurus 2 7b Prime Q8 0
Qwen_qvq 72b Preview
X Ai_grok 2 1212
Deepseek_deepseek R1 Distill Llama 70b
Deepseek_deepseek Chat V3
Gemini Exp 1206
Qwen_qwq 32b Preview
Gemini 2.0 Flash Exp
Gemini 2.0 Flash Thinking Exp 1219
O1 Mini
Deepseek_deepseek R1
O1 High
O3 Mini High
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
About this Benchmark"LLMs haven't met Tekman yet." (joke :D)
This benchmark evaluates mathematical problem-solving capabilities across different models. The benchmark consists of various mathematical problems, testing the models' ability to:
- •Understand and interpret mathematical questions
- •Apply correct mathematical concepts and formulas
- •Provide accurate numerical answers
- •Show detailed solution steps (where applicable)
Note: The actual questions are not displayed to maintain the integrity of the benchmark. Only aggregate performance metrics are shown.