Join daily and weekly newsletters to obtain the latest updates and exclusive content to cover the leading artificial intelligence in the industry. Learn more
A New paper By researchers from Google Research and University of California, Berkeley, It shows that the amazing limitation time approach can enhance the capabilities of large language models (LLMS). key? Increased research based on samples, a technique that relies on generating multiple responses and using the same form to verify it.
The basic conclusion is that even the minimum extent for sampling research, using random samples and self-transformation, can raise the performance of thinking in models such as Gemini 1.5 Pro to beyond O1-PREVIEW on common standards. Results can have important effects on institutions’ applications and challenge the assumption that very specialized training or complex structures are always necessary to achieve higher performance.
The limits of limiting the current test time
The current common way to expand the test time in LLMS is to train the model through reinforcement learning to create longer responses with the effects of chain (COT). This approach is used in models such as Openai O1 and Deepsek-R1. Although they are useful, these methods usually require a significant investment in the training stage.
The other test time measurement is “self -compatibility”, as the model generates multiple responses to the query and chooses the answer that appears often. Self -consistency reaches its limits when dealing with complex problems, as in these cases, the most repeated answer is not necessarily the correct answer.
The documentary research provides sampling sample is a simpler and very developmental alternative to testing time: Let the model create multiple responses and choose the best one through the verification mechanism. The documentary search can to complete the strategies of limiting the calculation of other test time, and as researchers write in their paper, “it also has an embarrassing unique feature and allowing arbitrary expansion: simply a sample of responses.”
More importantly, the search -based research can be applied to any LLM, including those that have not been explicitly trained to think.
How does the document search work to take samples
The researchers focus on implementing the minimum research based on sampling, using a language model to create and verify responses. This is a “self -verification” process, as the model evaluates its own outputs without relying on the answers of external truth or symbolic verification systems.

The algorithm works in a few simple steps:
1 – The algorithm begins to create a set of candidate solutions for the specific problem using the language model. This is done by giving the model the same router several times and using a non -zero temperature setting to create a variety of responses.
2 – The candidate’s response is subject to the verification process in which LLM demands are made several times by determining whether the response is correct. Then the average verification results are calculated to create the final verification degree for the response.
3- Choose the algorithm at the highest response that was registered as a final answer. If many candidates are close to each other, LLM is required to compare it and choose the best one. The response that wins the most marital comparisons is chosen as a final answer.
The researchers considered two main axes to expand the test time:
Samples: The number of responses generated by the model for each input problem.
Verification: The number of verification degrees calculated for each solution created
How to compare research based on samples with other technologies
The study revealed that the performance of thinking continues to improve through the research based on sampling, even when the scope of the test time is expanded beyond the point where self -consistency is satisfied.
Sufficiently, this implementation greatly enhances the accuracy of thinking in thinking criteria such as AIME and mathematics. For example, Gueini 1.5 Pro performance exceeded the performance of the O1-PREVIEW, which was explicitly trained in thinking problems, and GEMINI 1.5 Flash Gemini 1.5 Pro.

“This not only highlights the importance of research based on sampling of the limitation capacity, but it also indicates the benefit of searching samples as a simple basis line to compare strategies to limit other test time and measure real improvements in the possibilities of searching for models,” the researchers write.
It should be noted that although the results of the search -based samples are impressive, the costs can also become exorbitant. For example, with 200 samples and 50 checks for each sample, about 130 million icons will be born from AIME, which costs 650 dollars with Gemini 1.5 Pro. However, this is the narrower approach to research based on sampling, and it is compatible with the proposed improvement techniques in other studies. With more intelligent samples and verification methods, the costs of reasoning can be reduced significantly Using smaller models and Birth of fewer symbols. For example, using Gemini 1.5 Flash to check, costs drop to $ 12 per question.
Effective self -verification strategies
There is a continuous debate about whether LLMS can check their own answers. Researchers have identified two main strategies to improve self -verification with a test time account:
Compared to the response candidates: Disagreements between the candidate’s solutions indicate potential errors. By providing verification of multiple comparison responses, the model can better determine errors and hallucinations, and address the basic weakness in LLMS. The researchers describe this as an appearance of “implicit scaling”.
Re -writing the special task: Researchers suggest that the optimal output pattern of LLM depends on the task. The series of ideas is effective to solve the tasks of thinking, but it is easy to check the responses when writing them in a more formal, sporty style. Pencils can rewrite the nominated responses in more organized coordination (for example, theoretical-theoretical) before the evaluation.
“We expect the models to improve the possibilities of typical self -improvement quickly in the short term, as the models learn to take advantage of the principles of the implicit scaling pattern and the extent of improved scaling rates for research based on sampling,” the researchers write.
The effects of real world applications
The study shows that the relatively simple technology can achieve impressive results, which may reduce the need for a complex and expensive model structure or training systems.
This is also a developmental technology, allowing institutions to increase performance by allocating more mathematical resources to take samples and verify. It also enables developers to push border language models beyond their restrictions on complex tasks.
“Given that it complements the strategies of limiting the calculation of other test time, which is subject to parallel and allows arbitrary expansion, and recognizes the simple applications that are clearly effective, we expect the research based on sampling to play a decisive role as language models are assigned to solve the increasingly complex problems with increasingly large mathematical budgets.”
https://venturebeat.com/wp-content/uploads/2025/03/LLM-self-verification.jpeg?w=1024?w=1200&strip=all
Source link