Hugging Face shows how measuring testing time helps small language models punch above their weight

Join our daily and weekly newsletters for the latest updates and exclusive content on our industry-leading AI coverage. He learns more

In a new case study, researchers at Hugging Face show how to do just that Small language models The SLM can be configured to outperform larger models. Their results show that the Llama 3 model with 3B parameters can outperform the 70B version of the model on complex mathematical problems.

Hugging his face Fully documented The entire process and provides a roadmap for organizations that want to create their own custom thinking models.

Extend the test time calculation

The work is inspired by OpenAI o1which uses additional “reasoning” to solve complex problems in mathematics, programming, and reasoning.

The main idea behind models like o1 is to scale up the “test time compute”, which effectively means using more computing cycles during inference to test and verify different responses and trains of reasoning before producing the final answer. Scaling the test time calculation is especially useful when there is not enough memory to run a large model.

Since o1 is a proprietary model and OpenAI has remained tight-lipped about its inner workings, researchers have been speculating about how it works and trying to reverse engineer the process. There are already several Open alternatives for o1.

The face-hugging action depends on a The DeepMind study was released in Augustwhich investigates the trade-offs between inference time and pre-training computation. The study provides comprehensive guidance on how to balance training and computation heuristics to get the best results on a fixed budget.

In addition to using additional inference time computation, the success of the technique depends on two key components: the reward model that evaluates the SLM’s answers, and the search algorithm that optimizes the path it takes to improve its answers.

Various logic algorithms

The simplest way to use a test time measure is “majority voting,” where the same prompt is submitted to the form multiple times and the highest voter is chosen. On simple issues, majority voting can be useful, but its gains stabilize quickly on complex reasoning problems or tasks where errors are consistent across generations.

The most advanced way of thinking is “best of N”. In this technique, the SLM system generates multiple answers, but instead of a majority vote, a reward model is used to evaluate the answers and choose the best one. “N-weighted best,” a more precise version of this method, takes into account consistency to choose answers that are reliable and occur more frequently than others.

The researchers used the Process Reward Model (PRM) which records the SLM’s response not only to the final answer but also to the multiple stages it goes through to reach it. Their experiences showed that the “best of N” and “PRMs” brought Llama-3.2 1b Near Llama-3.2 8B level on the difficult MATH-500 test.

Add search

To further improve the model’s performance, the researchers added search algorithms to the model’s reasoning process. Instead of generating the answer in a single pass, they used “ray search,” an algorithm that guides the model’s answering process step by step.

At each step, the SLM generates multiple partial answers. The search algorithm uses a reward model to evaluate answers and select a subset worthy of further exploration. The process is repeated until the model exhausts its inference budget or reaches the correct answer. In this way, the inference budget can be narrowed to focus on the most promising answers.

The researchers found that although ray searching improves model performance on complex problems, it tends to perform less than other techniques on simple problems. To address this challenge, they added two more elements to their inference strategy.

The first was Diversified Validation Tree Search (DVTS), a form of radial search that ensures that the SLM does not stumble into faulty reasoning paths and diversifies its response branches. Second, they developed an “optimal computing strategy for scaling,” as proposed in the DeepMind paper, which dynamically chooses the best scaling strategy at test time based on the difficulty of the input problem.

The combination of these technologies enabled the Llama-3.2 1B to punch above its weight and outperform the 8B model by a significant margin. They also found that the strategy was scalable, and when applied to the Llama-3.2 3B, they were able to outperform the much larger 70B model.

Not the perfect solution yet

Extending the test time calculation changes the dynamics of the model costs. Organizations now have the ability to choose where to allocate their computing resources. For example, if your memory is short or you can tolerate slower response times, you can use a smaller model and spend more cycles of inference time generating more accurate answers.

However, measuring test time also has its limitations. For example, in experiments conducted by Hugging Face, researchers used a specially trained Llama-3.1-8B model as a PRM device, which requires two models to run in parallel (even if it is more resource efficient than the 70B model). The holy grail of testing time measurement, researchers admit, is to have “self-validation,” where the original model verifies its answer rather than relying on an external verifier. This is an open area of research.

The test time measurement technique presented in this study is also limited to problems for which the answer can be clearly assessed, such as coding and mathematics. Creating reward forms and validation tools for personal tasks like creative writing and product design requires more research.

But what is clear is that a test time scale has been established Lots of interest and activity We can expect more tools and technologies to emerge in the coming months. Companies would be wise to monitor how the landscape evolves.

Daily insights into business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from organizational transformations to hands-on deployments, so you can share insights to maximize ROI.

Read our privacy policy

Thanks for subscribing. Check more VB newsletters here.

An error occurred.