OpenAI o3又比o3-mini好多了

OpenAI's 03: A Leap Forward in Reasoning Capabilities

OpenAI's 03, announced in December 2024, is a successor to the O1 series and reportedly marks a significant leap forward in AI reasoning capabilities. OpenAI claims that 03 excels particularly in complex programming challenges and mathematical problem-solving, with significant performance gains over its predecessor, 01.  

Performance on Benchmarks

03 has reportedly achieved impressive results on several benchmarks:

  • Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI): O3 achieved nearly 90% accuracy on the ARC-AGI, almost three times the reasoning score of O1 models. This achievement highlights a significant advancement in OpenAI's model development.  
  • Frontier Math Benchmark: O3 recorded a 25% accuracy rate in the Frontier Math test, a massive leap from the previous best of 2%. This result showcases O3 as a standout performer in mathematical reasoning. This benchmark is particularly significant because it consists of novel, unpublished problems designed to be more challenging than standard datasets. Many of these problems are at the level of mathematical research, pushing models beyond rote memorization and testing their ability to generalize and reason abstractly.  
  • Codeforces Coding Test: O3 leads with a rating score of 2727, significantly outperforming its predecessor, O1 (1891), and DeepSeek's R1 (2029). This performance demonstrates its enhanced coding proficiency.  
  • SWE-bench Verified Benchmark: O3 scored 71.7%, surpassing DeepSeek R1 (49.2%) and OpenAI's O1 (48.9%). This superior performance highlights O3's strength in handling real-world software engineering problems.  
  • American Invitational Mathematics Examination (AIME) Benchmark: O3 achieved 96.7% accuracy, outpacing DeepSeek R1 (79.8%) and OpenAI's O1 (78%). This result underscores O3's exceptional skills in mathematical reasoning.  
  • Graduate-Level Google-Proof Q&A (GPQA) Benchmark: O3 scored 87.7% on the GPQA-Diamond Benchmark, significantly outperforming OpenAI O1 (76.0%) and DeepSeek R1 (71.5%). This indicates its superior performance in English comprehension tasks.  

所有跟帖: 

AIME 數學競賽都到96.7%了 -Bob007- 給 Bob007 發送悄悄話 (0 bytes) () 02/01/2025 postreply 22:04:24

一般人就不要在AI麵前炫耀數學了 -Bob007- 給 Bob007 發送悄悄話 (0 bytes) () 02/01/2025 postreply 22:06:08

專業研究水平的前沿數學benchmark,從百分之幾正確率提高到25% -Bob007- 給 Bob007 發送悄悄話 (0 bytes) () 02/01/2025 postreply 22:11:04

發燒友估計o3 因為運行成本太高而難以發布讓人使用。 -監考老師- 給 監考老師 發送悄悄話 監考老師 的博客首頁 (0 bytes) () 02/01/2025 postreply 22:14:21

順便提一句 o 係列是加強推理能力的新模型,和ChatGPT不同。看這裏有些網友可能還以為o1是GPT1. -監考老師- 給 監考老師 發送悄悄話 監考老師 的博客首頁 (0 bytes) () 02/01/2025 postreply 22:18:05

ChatGPT其實是front end,可以接到back end的不同GPT模型,包括o3。也有API供開發用。 -大觀園的賈探春- 給 大觀園的賈探春 發送悄悄話 大觀園的賈探春 的博客首頁 (0 bytes) () 02/01/2025 postreply 22:22:27

o1,o3 和 GPT1-4 是側重不同的係列。 -監考老師- 給 監考老師 發送悄悄話 監考老師 的博客首頁 (0 bytes) () 02/01/2025 postreply 22:32:25

OpenAI的business model是提供API給開發用,而不是下載local install。 -大觀園的賈探春- 給 大觀園的賈探春 發送悄悄話 大觀園的賈探春 的博客首頁 (0 bytes) () 02/01/2025 postreply 22:24:19

是呀,用的是OpenAI的算力。o3 功能強勁,但可能因為運行成本高而難以推出有吸引力的服務價格。 -監考老師- 給 監考老師 發送悄悄話 監考老師 的博客首頁 (0 bytes) () 02/01/2025 postreply 22:36:35

不想用API的人可以用Meta的Llama。Llama的下載是免費的。 -大觀園的賈探春- 給 大觀園的賈探春 發送悄悄話 大觀園的賈探春 的博客首頁 (0 bytes) () 02/01/2025 postreply 23:13:56

本地運行也是要成本的! -監考老師- 給 監考老師 發送悄悄話 監考老師 的博客首頁 (0 bytes) () 02/01/2025 postreply 23:22:19

當然。現在沒有任何數據說DS的本地運行成本就比Llama低。 -大觀園的賈探春- 給 大觀園的賈探春 發送悄悄話 大觀園的賈探春 的博客首頁 (0 bytes) () 02/01/2025 postreply 23:24:18

那個AI推理好,試了才知道。比如問這個題目 -t130152- 給 t130152 發送悄悄話 t130152 的博客首頁 (2769 bytes) () 02/02/2025 postreply 01:13:04

請您先登陸,再發跟帖!