Benchmark |
DeepSeek-R1 (%) |
OpenAI o1-1217 (%) |
Verdict |
AIME 2024 (Pass@1) |
79.8 |
79.2 |
DeepSeek-R1 wins (better math problem-solving) |
Codeforces (Percentile) |
96.3 |
96.6 |
OpenAI-o1-1217 wins (better competitive coding) |
GPQA Diamond (Pass@1) |
71.5 |
75.7 |
OpenAI-o1-1217 wins (better general QA performance) |
MATH-500 (Pass@1) |
97.3 |
96.4 |
DeepSeek-R1 wins (stronger math reasoning) |
MMLU (Pass@1) |
90.8 |
91.8 |
OpenAI-o1-1217 wins (better general knowledge understanding) |
SWE-bench Verified (Resolved) |
49.2 |
48.9 |
DeepSeek-R1 wins (better software engineering task handling) |