Bilkent AI's Last Exam

MATH

Comparing mathematical problem-solving capabilities across different AI models

New Benchmarks Coming Soon Stay Tuned (agents?)

Model Performance Leaderboard

#1O3 Mini High79.5%3544
#2O1 High75.0%3344
#3Deepseek_deepseek R163.6%2844
#4O1 Mini50.0%2142
#5Gemini 2.0 Flash Thinking Exp 121947.6%2042
#6Gemini 2.0 Flash Exp45.2%1942
#7Qwen_qwq 32b Preview33.3%1442
#8Gemini Exp 120631.0%1342
#9Deepseek_deepseek Chat V328.6%1242
#10Deepseek_deepseek R1 Distill Llama 70b25.0%1144
#11X Ai_grok 2 121221.4%942
#12Eurus 2 7b Prime Q8 019.0%842
#13Qwen_qvq 72b Preview19.0%842
#14Qwen_Qwen2.5 72B Instruct16.7%742
#15Minimax_minimax 0115.9%744
#16Gpt 4o14.3%642
#17Mistralai_mistral Large 241114.3%642
#18Phi 4 Q4 K M11.9%542
#19Anthropic_claude 3.5 Sonnet11.4%544
#20Cohere_command R Plus 08 20249.5%442
#21PowerInfer_SmallThinker 3B Preview7.1%342
#22Amazon_nova Pro V17.1%342
#23Liquid_lfm 40b7.1%342
#24Qwen_qwen Max6.8%344
#25Qwen_qwen 2.5 7b Instruct4.8%242
Qwen_qwen 2.5 7b Instruct
PowerInfer_SmallThinker 3B Preview
Amazon_nova Pro V1
Liquid_lfm 40b
Qwen_qwen Max
Cohere_command R Plus 08 2024
Anthropic_claude 3.5 Sonnet
Phi 4 Q4 K M
Gpt 4o
Mistralai_mistral Large 2411
Qwen_Qwen2.5 72B Instruct
Minimax_minimax 01
Eurus 2 7b Prime Q8 0
Qwen_qvq 72b Preview
X Ai_grok 2 1212
Deepseek_deepseek R1 Distill Llama 70b
Deepseek_deepseek Chat V3
Gemini Exp 1206
Qwen_qwq 32b Preview
Gemini 2.0 Flash Exp
Gemini 2.0 Flash Thinking Exp 1219
O1 Mini
Deepseek_deepseek R1
O1 High
O3 Mini High
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

About this Benchmark
"LLMs haven't met Tekman yet." (joke :D)

This benchmark evaluates mathematical problem-solving capabilities across different models. The benchmark consists of various mathematical problems, testing the models' ability to:

  • Understand and interpret mathematical questions
  • Apply correct mathematical concepts and formulas
  • Provide accurate numerical answers
  • Show detailed solution steps (where applicable)

Note: The actual questions are not displayed to maintain the integrity of the benchmark. Only aggregate performance metrics are shown.