Human scholars reveal how artificial intelligence “thinks” actually – and discovered that he is planning a secret forward and sometimes lies

Photo of author

By [email protected]


Join daily and weekly newsletters to obtain the latest updates and exclusive content to cover the leading artificial intelligence in the industry. Learn more


man Develop a new way to imagine in large language models such as ClaudeDetecting for the first time how artificial intelligence systems deal with this information and make decisions.

The search, which was published today in two papers (Available here and here), These models appear more advanced than previously understood – they plan for the future when writing poetry, and the use of the same internal scheme to explain ideas regardless of the language, and sometimes work back from the desired result rather than simply building from the facts.

The work that is inspired by Neuroscience techniques It is used to study biological minds, and represents great progress in the ability to explain artificial intelligence. This approach can allow researchers to audit these systems for safety issues that may remain hidden during traditional external tests.

“We have created these artificial intelligence systems with great capabilities, but because of how they are training, We did not understand “Inside the model, it is just a set of numbers – weights in the synthetic nerve network,” said Joshua Batson, Anthropor researcher, in an exclusive interview with Venturebeat.

New technologies that illuminate the previously hidden decisions process

Great language models like Openai’s GPT-4OAntarbur ClaudeAnd Google’s twin Wonderful capabilities, from writing code to the manufacture of research papers. But these systems have worked largely in the name of “”Black boxes– Even creators often do not understand exactly how they reach certain responses.

The new interpretation techniques for Antarbur, which the company calls “Follow the circleAnd “and”Charts chain of transmissionAllowing researchers to draw specific paths of features that resemble neurons that are active when models perform tasks. The approach borrows concepts of neuroscience, and display artificial intelligence models as biological systems.

“This work runs almost philosophical questions -” Do you think of models? Is the layout of the models? Are the models only renewed information? – In concrete scientific inquiries about what is literally happening within these systems, “Batson said.

Claude’s hidden layout: How to wear hair lines and solve geography questions

Among the most striking discoveries, there was evidence that Claude was planning when writing the hair. When he was asked to form a conjunction of rhyme, the model has determined potential rhyme words for the end of the next line before writing began – a level of development that surprised even researchers in the anthropor.

“This may happen everywhere,” Patson said. “If you had asked me before this research, I would have guessed that the model is thinking about different contexts. But this example provides the most residence evidence that we saw from this ability.”

For example, when writing a poem ending with the “rabbit”, the model activates the features that represent this word at the beginning of the line, and then creates the sentence to naturally reach this conclusion.

The researchers also found that Claude performed original Thinking multi -steps. In a test, he asks “The capital of the country that contains Dallas is …” The model first activates the features that represent “Texas”, then this acting is used to determine “Austin” as a correct answer. This indicates that the model is actually performing a series of thinking rather than just renewing the preserved associations.

By treating these internal representations – for example, replacing “California” Texas – researchers can cause the model “Sacramento” instead, confirming the causal relationship.

Beyond translation: The International Language Concept Network revealed to Claude

Another major discovery includes how Claude deals Multiple languages. Instead of maintaining separate English, French and Chinese systems, the model appears to translate concepts into common abstract representation before creating responses.

“We find that the model uses a mixture of language and abstract circles, and the language independent,” researchers write at Determine them. When asked about the opposite of “small” in different languages, the model uses the same internal features that represent “opposites” and “small”, regardless of the input language.

This conclusion has effects on how to transfer knowledge models learned in one language to others, and indicates that models with the largest census of female teachers have developed more representatives rich in language.

When artificial intelligence offers answers: discovery of Claude sports manufacturing

Perhaps what matters most, revealing the search for cases where the logic of Claude does not match with what he claims. When providing difficult mathematics problems such as calculating the values ​​of the perfection of large numbers, the model sometimes claims to follow an account that is not reflected in its internal activity.

“We are able to distinguish between the cases in which the model is sincerely the steps that they say they are performing, and the cases in which he presents his thinking without consideration of the truth, and the cases in which he works back from the idea provided by man,” The researchers explain.

In one example, when the user suggests an answer to a difficult problem, the model works back to build a series of thinking that leads to this answer, rather than working forward from the first principles.

“We distinguish mechanically an example of Claude 3.5 Haiko using a series of believers in an example of the chains of un sincedom thought,” says the paper. “In one, the model is displayed”prattle“… in the other, thinking appears in thinking.”

Inside the hallucinations of artificial intelligence: How does Claude decide when to answer or reject questions

The research also provides an insightful look at the cause of hallucinations of language models – the formation of information when they do not know an answer. Anthropor found evidence of a “virtual” circle that causes Claude to answer the questions, which are installed when the model recognizes the entities he knows.

The researchers explain: “The model contains” virtual “circles that reject it to answer the questions. “When a question is asked about something he knows, it activates a set of features that prevent this virtual circle, allowing the model to answer the question.”

When this mechanism leads to its imbalance – recognition of the entity but lacks the specific knowledge of this – hallucinations can occur. This explains the reason for the availability of models with incorrect information about the well -known characters, while refusing to answer questions about those mysterious.

Safety effects: Using a circle tracking to improve the reliability of artificial intelligence and trustworthy

This research represents an important step towards making artificial intelligence systems more transparent and perhaps safer. By understanding how models reach their answers, researchers can identify and address problematic thinking patterns.

The anthropologist has long emphasized the capabilities of the work of the interpretation. In them May 2024 Sonata paperThe researchers wrote at the time: “We hope that we can use these discoveries to make the models safer.” “For example, it may be possible to use the techniques shown here to monitor artificial intelligence systems for some dangerous behaviors – such as user deception – to direct them towards the desired results, or to remove a completely dangerous topic.”

Today’s advertisement depends on this basis, although Batson warns that current technologies still have great restrictions. They only pick up part of the total account performed by these models, and the analysis of the results remains thick.

“Even in simple claims, our way is only a small part of the total account performed by Claude,” researchers admitted their latest work.

The future of transparency of artificial intelligence: Challenges and opportunities in the interpretation of the model

The new technologies of the Anthropor are increasingly concerned about the transparency of artificial intelligence and safety. Since these models become more powerful and more widespread, understanding their internal mechanisms becomes increasingly important.

Search also possible commercial effects. Since institutions are increasingly dependent on the large language models of operating applications, understanding when and why these systems may provide incorrect information that becomes decisive to managing risk.

“Antarubor wants to make the models safe in the broad sense, including everything from a diluted bias to ensure frankly artificial intelligence to prevent misuse – including in scenarios Catastrophic“The researchers write.

While this research represents great progress, Patson stressed that it is only the beginning of a much longer trip. He said, “The work has really started.” “Understanding the representations that the model uses does not tell us how to use it.”

Currently, Anthropor Follow the circle It offers the first preliminary map of the previously immovable lands – as it resembles the first anatomy scientists who draw the first raw plans of the human brain. The complete atlas remains to realize artificial intelligence, but we can now see the outlines of how these systems think.



https://venturebeat.com/wp-content/uploads/2025/03/nuneybits_Vector_art_of_circuit_tracing_in_a_brain_fluorescent__e5c3e1b0-80ed-49e4-b134-c5e84232180d.webp?w=1024?w=1200&strip=all
Source link

Leave a Comment