Bilkent AI's Last Exam Benchmark

Model Performance Leaderboard


#1	O3 Mini High	79.5%	35	44
#2	O1 High	75.0%	33	44
#3	Deepseek_deepseek R1	63.6%	28	44
#4	O1 Mini	50.0%	21	42
#5	Gemini 2.0 Flash Thinking Exp 1219	47.6%	20	42
#6	Gemini 2.0 Flash Exp	45.2%	19	42
#7	Qwen_qwq 32b Preview	33.3%	14	42
#8	Gemini Exp 1206	31.0%	13	42
#9	Deepseek_deepseek Chat V3	28.6%	12	42
#10	Deepseek_deepseek R1 Distill Llama 70b	25.0%	11	44
#11	X Ai_grok 2 1212	21.4%	9	42
#12	Eurus 2 7b Prime Q8 0	19.0%	8	42
#13	Qwen_qvq 72b Preview	19.0%	8	42
#14	Qwen_Qwen2.5 72B Instruct	16.7%	7	42
#15	Minimax_minimax 01	15.9%	7	44
#16	Gpt 4o	14.3%	6	42
#17	Mistralai_mistral Large 2411	14.3%	6	42
#18	Phi 4 Q4 K M	11.9%	5	42
#19	Anthropic_claude 3.5 Sonnet	11.4%	5	44
#20	Cohere_command R Plus 08 2024	9.5%	4	42
#21	PowerInfer_SmallThinker 3B Preview	7.1%	3	42
#22	Amazon_nova Pro V1	7.1%	3	42
#23	Liquid_lfm 40b	7.1%	3	42
#24	Qwen_qwen Max	6.8%	3	44
#25	Qwen_qwen 2.5 7b Instruct	4.8%	2	42

About this Benchmark

"LLMs haven't met Tekman yet." (joke :D)

This benchmark evaluates mathematical problem-solving capabilities across different models. The benchmark consists of various mathematical problems, testing the models' ability to:

•Understand and interpret mathematical questions
•Apply correct mathematical concepts and formulas
•Provide accurate numerical answers
•Show detailed solution steps (where applicable)

Note: The actual questions are not displayed to maintain the integrity of the benchmark. Only aggregate performance metrics are shown.

Bilkent AI's Last Exam

Model Performance Leaderboard

About this Benchmark"LLMs haven't met Tekman yet." (joke :D)

About this Benchmark
"LLMs haven't met Tekman yet." (joke :D)