Microsoft’s new rStar-Math technology upgrades small models to outperform OpenAI’s o1 preview on math problems

Join our daily and weekly newsletters for the latest updates and exclusive content on our industry-leading AI coverage. He learns more

Microsoft is doubling down on the potential of small language models (SLM) with Detection of rStar-Matha new reasoning technique that can be applied to small models to enhance their performance on mathematical problems using reasoning techniques – performance similar to, and in some cases exceeding, that of OpenAI’s o1 mockup.

While still in the research phase – as described in A Paper published on pre-review site arXiv.org Eight authors at Microsoft, Peking University and Tsinghua University in China are credited – and the technique has been applied to several different smaller open source models including Microsoft’s Phi-3 mini and Alibaba’s Qwen-1.5B (1.5 billion parameter model). . and Qwen-7B (7 billion parameter model). It showed improved performance on all of them, even surpassing the previously most advanced OpenAI model in the world mathematics (Solving Word Problems) A third party standard consisting of 12,500 questions covering various branches such as geometry, algebra and all difficulty levels.

In the end, according to A Share on face huggingThe researchers plan to make their code and data available on Github at https://github.com/microsoft/rStarAlthough one of the paper’s authors, Li-Lina Zhang, wrote in the comments on the Hugging Face post that the team is “still going through the internal review process for the open source version.” As such, the repository remains private for the time being. Please stay tuned!

Community members expressed their enthusiasm, calling the innovations “impressive” and praising the combination of Monte Carlo Tree Search (MCTS) with step-by-step thinking. One commentator highlighted the simplicity and usefulness of using Q-values to record steps, while others speculated on future applications in geometric proofs and symbolic reasoning.

This news follows closely on the heels of open sourcing for Microsoft FI-4 model, A smaller, 14 billion-parameter AI system on Hugging Face is now available under a permissive MIT license.

While the Phi-4 version has expanded access to small, high-performance models, rStar-Math showcases a specialized approach: using smaller AI systems to achieve state-of-the-art results in mathematical reasoning.

rStar-Math works using many different models and components to help the target micromodel “self-evolve.”

The key to rStar-Math is that it leverages the Monte Carlo Tree Search System (MCTS), a method that mimics human “deep thinking” by iteratively optimizing step-by-step solutions to mathematical problems.

The researchers used MCTS because it “breaks down complex mathematical problems into simpler one-step tasks, reducing difficulty” for smaller models.

However, they did not only apply MCTS as other researchers did. Instead, in a stroke of brilliance, they also asked the model they trained to always output the inference steps of the “chain of thoughts” as natural language descriptions and Python code.

They assumed that the model would include natural language responses as Python code comments, and only those outputs using Python would be used to train the model.

The researchers also trained a “policy model” to generate mathematical inference steps and a process preference model (PPM) to select the most promising steps to solve problems, and improved them over four rounds of “self-evolution” with each model. Improve the other.

For their initial data, the researchers said they used “747,000 verbal math problems from publicly available sources,” along with their solutions, but created new steps to solve them using the two models described above.

Record results

After four rounds of self-development, rStar-Math has achieved important achievements:

• on Mathematics standard,The accuracy of the Qwen2.5-Math-7B model jumped from 58.8% to 90.0%, outperforming OpenAI o1-preview.

• on American Invitational Mathematics Examination (AIME)It solved 53.3% of the problems, placing it in the top 20% of high school competitors.

These results highlight the power of SLM techniques in dealing with complex mathematical inference, traditionally dominated by larger systems.

Smaller is better?

In recent years, AI innovation has been largely driven by scaling language models, with increasing parameters seen as a way to improve performance. However, the high costs associated with such large models, from computational resources to power consumption, have raised questions about scalability.

Microsoft offers an alternative path that focuses on efficiency. The rStar-Math release also underscores this commitment by showing how SLMs can rival—and in some cases exceed—the capabilities of their larger counterparts.

Microsoft’s dual releases of the Phi-4 and the rStar-Math paper suggest that compact and specialized models can provide powerful alternatives to the largest systems in the industry.

Moreover, by outperforming larger competitors on key criteria, these models challenge the idea that bigger is always better. It opens the doors for mid-sized organizations and academic researchers to access cutting-edge capabilities without the financial or environmental burden of large models.

Daily insights into business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from organizational transformations to hands-on deployments, so you can share insights to maximize ROI.

Read our privacy policy

Thanks for subscribing. Check more VB newsletters here.

An error occurred.