Nvidia New Llama-3.1 Nemotron Ultra outperforms Deepseek R1 in half-size

Photo of author

By [email protected]


Join daily and weekly newsletters to obtain the latest updates and exclusive content to cover the leading artificial intelligence in the industry. Learn more


until Meta pays questions and criticisms to the new Llama 4 familyThe GPU Master Nvidia has released a new new source of an open source source (LLM) based on the Llama-3.1-305B-Instrect model, and it is alleged near the highest performance on a variety of third-party standards-outperform the open source model Deepseek R1.

Llama-33.1-Sunotron-Ultra-253B-V1, is a thick teacher of 253 billion designed to support advanced thinking, instructions and AI’s workflow. He was I mentioned for the first time at the annual GPU conference in NVIDIA (GTC) in March.

The version reflects the NVIDIA continuous focus on improving performance through architectural and after -targeted training.

Last night, April 7, 2025, announced The model code is generally available in the face of embraceWith open weights and post -training data. It is designed to work efficiently in both the conditions of “thinking” and “logic”, allowing developers to switch between complex high -thinking tasks and the most obvious outputs based on the system’s demands.

Designed for effective inference

LLAMA-33.1-Sunotron-Ultra-253B depends on the previous NVIDIA work in the improved LLM development. Its structure – which is implemented through the search for nervous engineering (NAS) – corresponds to structural differences such as skipping attention layers, fascinated wrong networks (FFNS), and variable FFN pressure ratios.

This architectural repair reduces memory imprint and mathematical requirements without strongly affecting the quality of the output, allowing publishing on one GPU 8x H100.

The result, according to NVIDIA, is a model that provides a strong performance while it is more expensive to publish in data center environments. Additional devices compatibility includes support for micro engineering in NVIDIA B100 and Hopper, while checking the configurations in both the BF16 and FP8 resolution conditions.

Post -training for thinking and alignment

NVIDIA has strengthened the foundation model through a multi -stage pipeline after training. This included the installation subject to supervision across fields such as mathematics, code generation, chatting, and the use of tools, followed by reinforcement learning while improving the group’s relative policy (GRPO) to increase the increase in the performance of instructions and thinking.

The form of knowledge distillation has undergone more than 65 billion symbols, followed by continuous pre -training on 88 billion additional symbols.

Training data collections included sources such as Fineweb, Buzz-V1.2 and Dolma. Post -training and responses were extracted from a group of public companies’ methods and artificial generation methods, including data groups that taught the model to distinguish between thinking situations.

Improving performance across many fields and standards

The evaluation results show noticeable gains when the model works in the enabled mode of thinking. For example, on the Math500 standard, performance increased from 80.40 % in standard mode to 97.00 % with empowerment of thinking.

Likewise, the results increased on the AIME25 standard from 16.67 % to 72.50 %, and the results of LiveCodebeench more than weakness, jumped from 29.03 % to 66.31 %.

Performance gains were also observed in tool -based tasks such as BFCL V2 and job formation, as well as in answering public questions (GPQA), where the model recorded 76.01 % in thinking mode compared to 56.60 % without.

These criteria were conducted with the maximum sequence of 32000 symbols, and the repetition of each test up to 16 times to ensure accuracy.

compared to Deepsek R1, MEE model on the latest model with 671 billion teachersLlama-3.1-heotron-Ultra-253B explains competitive results despite the presence of less than half of the number of parameters (model settings)-excellence in tasks such as GPQA (76.01 versus 71.5), IFEVAL (89.45 compared to 83.3), and LiveCodebench Coding (66.31).

Meanwhile, DeepSeek R1 has a clear advantage over some mathematics reviews, especially AIME25 (79.8 versus 72.50), and slightly Math500 edges (97.3 versus 97.00).

These results indicate that although they are a thick model, the introduction of NVIDIA offers matches or exceeds the alternatives of MEE about the tasks of general education and alignment of public instructions, while it is slightly backward in heavy categories in mathematics.

Use and integration

The model is compatible with the Luging Face Transformers Library (version 4.48.3) and supports the input sequence and outputs up to 128,000 symbols.

Developers can control the behavior of thinking through system claims and determine the decoding strategies based on task requirements.

For thinking tasks, NVIDIA recommends using temperature samples (0.6) at a value of 0.95. For inevitable outputs, it is preferable to decipher the greed.

Llama-3.1-Sunotron-Ultra-253B supports multi-language applications, with capabilities in English and many additional languages, including German, French, Italian, Portuguese, Hindi, Spanish and Thai.

It is also suitable for shared LLM use such as Chatbot development, AI’s agent function, pre -recovery generation (RAG), and code generation.

Licensed for commercial use

It was issued under the NVIDIA Open Model license, which is ruled by the Lama 3.1 community license agreement, and is ready for commercial use.

NVIDIA stressed the importance of developing AI responsible, and encouraging the difference to assess the features of the model, safety and bias for specific use.

Oleksii kuchaiev, the AI ​​model manager after training in NVIDIA, Share the advertisement on xSaying that the team was excited to share the open version, describing it as a thick model of 253B designed with the possibilities of switching/stopping it and releasing it with weights and open data.




Source link

Leave a Comment