OpenAI claims that its new model has reached the human level in a test of “general intelligence.” What does that mean?

Photo of author

By [email protected]


A new paradigm of artificial intelligence (AI) has just arrived It achieved results on a human level In a test designed to measure “general intelligence.”

On December 20, OpenAI’s o3 system scored 85% on ARC-AGI standardwell above the previous best AI score of 55% and on par with the average human score. She also got good grades on a very difficult math test.

Creating artificial general intelligence, or AGI, is the stated goal of all major AI research laboratories. At first glance, it appears that OpenAI has at least taken a significant step toward this goal.

Although doubts persist, many AI researchers and developers feel that something has changed. For many, the prospect of artificial general intelligence now seems more real, more urgent, and closer than expected. Are they right?

Generalization and intelligence

To understand what an o3 score means, you need to understand what the ARC-AGI test means. Technically, this is a test of the “sample efficiency” of an AI system in adapting to something new – how many examples of a new situation does the system need to see to know how it works.

An AI system like ChatGPT (GPT-4) is not very efficient at using samples. They have been “trained” on millions of examples of human text, building probabilistic “rules” around the most likely word combinations.

The result is very good on common tasks. It is bad at uncommon tasks, because it has less data (fewer samples) about those tasks.

Until AI systems can learn from a small number of examples and adapt to greater efficiency in a sample, they will only be used for highly repetitive jobs and those where occasional failure is acceptable.

The ability to accurately solve unknown or new problems from limited samples of data is known as generalization ability. It is widely considered a necessary, even essential, component of intelligence.

Networks and patterns

The ARC-AGI criterion tests the operant adaptation sample using least square problems such as the one below. The AI ​​needs to discover the pattern that transforms the grid on the left into the grid on the right.

Several patterns of colored squares on a black grid background.
A sample task from the ARC-AGI standardized test.
Ark Award

Each question gives three examples to learn from. The AI ​​system then needs to know the rules that “generalize” from the three examples to the fourth example.

These tests are very similar to IQ tests that you may sometimes remember from school.

Weak rules and adaptation

We don’t know exactly how OpenAI did this, but the results suggest that the o3 model is highly adaptable. Through just a few examples, he finds rules that can be generalized.

To recognize a pattern, we should not make any unnecessary assumptions, or be more specific than we really need to be. in theoryIf you can identify the “weaker” rules that do what you want, you have maximized your ability to adapt to new situations.

What do we mean by weakest rules? The technical definition is complex, but the weakest rules are usually the ones that can be Described in simpler statements.

In the example above, the simple English expression for the rule might be something like: “Any shape with a prominent line will move to the end of that line and ‘cover’ any other shapes it overlaps.”

Searching chains of thought?

Although we don’t know how OpenAI achieved this result yet, it seems unlikely that they intentionally optimized the o3 system to find weak rules. However, to succeed in ARC-AGI missions, they must be found.

We know that OpenAI started with a general-purpose version of the o3 model (which differs from most other models, because it can spend more time “thinking” about difficult questions) and then trained it specifically to test ARC-AGI.

French artificial intelligence researcher François Chollet, who designed the standard, He believes o3 looks at different “chains of thought” that describe the steps to solve the task. It will then choose the “best” according to some loosely defined rules, or “heuristic.”

This would be no different from how Google’s AlphaGo system searched through different possible sequences of moves to beat the Go world champion.

You can think of these trains of thought like programs that fit examples. Of course, if it is like a Go-playing AI program, it needs a heuristic or loose rule to determine which program is better.

There can be thousands of different programs that seem equally valid. This heuristic could be “choosing the weakest” or “choosing the simplest”.

However, if it was like AlphaGo, they would have simply had an AI create a heuristic. This was the process for AlphaGo. Google trained a model to rate different sequences of moves as better or worse than others.

What we don’t know yet

The question then is: Is this really closer to artificial general intelligence? If this is how o3 works, the basic model may not be much better than previous models.

The concepts that the model learns from language may no longer be more suitable for generalization than before. Instead, we may just be seeing a more general “train of thought” found through the additional steps of specialized heuristic training in this test. The proof, as always, will be in the pudding.

Almost everything about o3 remains unknown. OpenAI has limited its disclosure to a few media presentations and early tests to a few researchers, labs, and AI safety organizations.

Really understanding o3’s potential will require extensive work, including assessments, understanding the distribution of its capabilities, how often it fails, and how often it succeeds.

When o3 is finally released, we’ll have a much better idea of ​​whether it’s nearly as adaptable as the average human.

If so, it could have a huge revolutionary economic impact, ushering in a new era of accelerated intelligence for self-improvement. We will need to set new standards for artificial general intelligence itself, and seriously consider how it should be managed.

If not, this would still be an impressive result. However, daily life will remain largely the same.Conversation

Michael Timothy BennettPhD student, College of Computing, Australian National University and Elijah PerrierResearch Fellow, Stanford Center for Responsible Quantum Technology, Stanford University

This article was republished from Conversation Under Creative Commons license. Read Original article.



https://gizmodo.com/app/uploads/2024/12/GettyImages-2188218335.jpg

Source link

Leave a Comment