Join our daily and weekly newsletters for the latest updates and exclusive content on our industry-leading AI coverage. He learns more
Latest OpenAI O3 model He made progress that surprised the artificial intelligence research community. o3 scored an unprecedented 75.7% on the highly challenging ARC-AGI test under standard compute conditions, with the high compute version reaching 87.5%.
While the achievement in ARC-AGI is impressive, the code has not yet been proven Artificial general intelligence (AGI) has been cracked.
Abstract reasoning group
The ARC-AGI standard is based on Abstract reasoning groupwhich tests an AI system’s ability to adapt to new tasks and demonstrate flexible intelligence. ARC consists of a set of visual puzzles that require understanding basic concepts such as objects, boundaries, and spatial relationships. While humans can solve ARC puzzles easily with just a few demonstrations, current AI systems have difficulty solving them. ARC has long been considered one of the most challenging AI metrics.

ARC is designed in such a way that it cannot be cheated by training models on millions of examples in the hope of covering all possible combinations of puzzles.
The benchmark consists of a generic training set containing 400 simple examples. The training set is supplemented with a general evaluation set containing 400 more challenging puzzles as a means of assessing generalizability Artificial intelligence systems. The ARC-AGI Challenge contains private and semi-private quiz pools of 100 puzzles each, which are not shared with the public. They are used to evaluate candidate AI systems without the risk of data leaking to the public and contaminating future systems with prior knowledge. Furthermore, the competition sets limits on the amount of calculations participants can use to ensure that puzzles are not solved through brute force methods.
Breakthrough in solving new tasks
o1-preview and o1 scored a maximum of 32% in ARC-AGI. There is another method developed by the researcher Jeremy Berman He used a hybrid approach, combining Claude 3.5 Sonnet, genetic algorithms, and a code compiler to achieve 53%, the highest score before o3.
In a Blog postFrançois Cholet, ARC’s creator, described o3’s performance as “a surprising and significant increase in AI capabilities, demonstrating a new ability to adapt to tasks never before seen in models of the GPT family.”
It is important to note that using more computing on previous generations of models could not reach these results. For context, it took 4 years for models to progress from 0% with GPT-3 in 2020 to only 5% with GPT-4o in early 2024. Although we don’t know much about the o3 architecture, we can be confident that like that. Not orders of magnitude larger than its predecessors.

“This is not just an incremental improvement, but a real breakthrough, representing a qualitative shift in AI capabilities compared to the previous limitations of LLM,” Chollet wrote. “o3 is a system capable of adapting to tasks it has never encountered before, and is arguably approaching human-level performance in the ARC-AGI field.”
It is worth noting that o3 performance on ARC-AGI comes at a significant cost. In a low compute configuration, the model costs between $17 and $20 and 33 million tokens to solve each puzzle, while in a high compute budget, the model uses about 172 times as much computation and billions of tokens per problem. However, as the costs of inference continue to decline, we can expect these numbers to become more reasonable.
A new paradigm in LLM thinking?
The key to solving new problems is what Cholet and other scientists refer to as “program synthesis.” System thinking must be able to develop small programs to solve very specific problems, and then combine these programs to address more complex problems. Classical language models have absorbed a lot of knowledge and contain a rich set of internal programs. But they lack synthesis, which prevents them from discovering mysteries that go beyond the distribution of their training.
Unfortunately, there is very little information about how o3 works under the hood, and this is where scientists have mixed opinions. Cholet expects o3 to use a type of software installation that it does A series of ideas (CoT) The search mechanism is coupled with a reward model that evaluates and improves solutions as the model generates tokens. This is similar to what Open source inference models It has been explored in the last few months.
And other scholars such as Nathan Lambert The Allen Institute for AI suggests that “o1 and o3 could actually be just forward passes from a single language model.” On the day of the o3 announcement, Nat McAleese, a researcher at OpenAI, said, Published on X that o1 was “just an LLM trained with RL. o3 is operated by extending the scope of RL beyond o1.”

On the same day, Denny Zhu of Google DeepMind’s inference team described the combination of existing search and reinforcement learning approaches as a “dead end.”
“The beauty of LLM heuristics is that the reasoning process is generated in an autoregressive manner, rather than relying on searching (e.g. mcts) across the generative space, whether through a well-crafted model or a carefully designed prompt.” Published on X.

While the details of the causes of o3 may seem trivial compared to the breakthrough made at ARC-AGI, they could very well define the next paradigm shift in training LLMs. There is currently debate over whether laws for expanding MBAs through training data and computing have reached an impasse. Whether the test time measurement is based on better training data or different inference architectures can determine the next path forward.
Not artificial general intelligence
The name ARC-AGI is misleading and some have equated it with an AGI solution. However, Cholet emphasizes that “ARC-AGI is not an acid test for artificial general intelligence.”
“Passing ARC-AGI does not mean achieving AGI, and in fact, I don’t think o3 is AGI yet,” he wrote. “O3 still fails at some very easy tasks, indicating fundamental differences with human intelligence.”
Furthermore, it indicates that o3 cannot learn these skills independently and relies on external validators during reasoning and human-initiated chains of reasoning during training.
Other scientists have pointed out flaws in OpenAI’s reported results. For example, the model is optimized on the ARC training set to achieve state-of-the-art results. “The person doing the solving should not need much specific ‘training’, either in the field itself or in each specific task,” the scientist wrote. Melanie Mitchell.
To investigate whether these models possess the kind of abstraction and inference that the ARC standard was created to measure, Mitchell suggests “seeing whether these systems can adapt to variables in specific tasks or to inference tasks using the same concepts, but in domains other than ARC.” “
Cholet and his team are currently working on a new benchmark that is challenging for o3 and will likely drop its score below 30% even with a high compute budget. Meanwhile, humans will be able to solve 95% of puzzles without any training.
“You will know that AGI exists when the practice of creating tasks that are easy for ordinary humans but difficult for AI becomes simply impossible,” Schulet wrote.
https://venturebeat.com/wp-content/uploads/2023/11/DALL·E-2023-11-12-18.17.05-Create-an-abstract-depiction-of-artificial-general-intelligence-AGI-in-a-16_9-format.-The-image-should-feature-a-dynamic-and-complex-array-of-interc-1.png?w=1024?w=1200&strip=all
Source link