Deepcoder offers upper coding performance in an effective 14B model

Photo of author

By [email protected]


Join daily and weekly newsletters to obtain the latest updates and exclusive content to cover the leading artificial intelligence in the industry. Learn more


Researchers in Together, Amnesty International and agent I released Deepcoder-14B, a new coding model that offers a wonderful performance similar to the leading models of ownership Openai’s O3-MINI.

This model, designed at the top of Deepseek-R1, provides more flexibility to integrate the capabilities of the high-performance symbol and thinking capabilities in real world applications. More importantly, the difference has opened the full model, training data, symbols, records and system improvements, which can help researchers improve their work and accelerate progress.

Competitive coding capabilities in a smaller package

Research team experiences show that Deepcoder-14B works vigorously via many difficult coding standards, including LiveCodebench (LCB), Codeforcees and HumaneVal+.

“Our model shows a strong performance through all coding standards … similar to the performance of the O3-MINI (low) and O1,” researchers write at a Blog post Which describes the model.

Interestingly, although it is mainly training on coding tasks, the model shows improved mathematical thinking, with 73.8 % on the AIME 2024 index, which is 4.1 % improvement on its basic model (Deepseek-R1-Distill-Swen-14B). This indicates that the thinking skills developed through the RL on the code can be effectively circulated to other areas.

Deepcoder-14B performance
Credit: together artificial intelligence

The most surprising aspect is to achieve this level of performance with only 14 billion. This makes Deepcoder much smaller and perhaps more efficient to run it than many border models.

Innovations that lead Deepcoder

During the development of the model, the researchers resolved some of the main challenges in Training coding models Using reinforcement learning (RL).

The first challenge was to organize training data. Learning reinforcement requires reliable reward signals indicating that the output of the model is correct. The researchers also notes, “Unlike mathematics-where high-quality high-quality data is available, easily on the Internet-the coding field suffers from a relative scarcity of these data.”

To address this problem, Deepcoder has implemented a strict pipeline that collects examples of different data collections and liquidate them for the sake of validity, complexity and duality. This process resulted in 24,000 high -quality problems, providing a solid basis for effective RL training.

The team also designed a direct reward function that provides a positive signal only if it passes the code that was created all the unit tests from which samples were taken for the problem during a specific time period. Besides high -quality training examples, the bonus system that focuses on the model results prevents tricks such as printing answers preserved for general tests or improving simple edge cases without solving the basic problem.

The basic training algorithm of the model depends on improving the Group’s relative policy (GRPO), which is the reinforcement learning algorithm that has proven Very successful in Deepseek-R1. However, the team has made several adjustments to the algorithm to make it more stable and allow the model to continue to improve with the training extending for a longer period.

GRPO+
GRPO+ Deepcoder-14 managed to continue in longer periods without the collapse of credit: together Amnesty International

Finally, the team expanded the window of the context of the model frequently, and first trained in the shortening sequences and gradually increasing the length. They also developed a liquidation method to avoid punishing the model when it created thinking chains that exceeded the limits of context when solving a solid claim.

Extension of the repetitive context
Deepcoder has been trained in 32K context problems, but also managed to solve 64K credit: together Amnesty International

The researchers explain the basic idea: “To maintain long thinking in the context while enabling effective training, we have combined the overlapping filtration … This technique hides cut sequences during training so that the models are not punished to generate studied but long output outputs that exceed the current context.”

Training has been gradually limited from 16 thousand to a window of 32K context, and the resulting model can also solve problems that require up to 64 kilos.

Improving RL training in the long context

Training large models with RL, especially on tasks that require long sequences created such as coding or complex thinking, intense and arithmetic. The main bottle neck is the “sampling” step, as the model generates thousands of symbols in the batch. Disciplines in the length of response means that some responses end late than others, leaving the treatment of inactive graphics and slowing the entire training ring.

To accelerate this, the Verl-Pipeline team, an improved extension of the open source Verl Library Learning reinforcement from human reactions (RLHF). The main innovation, which they call “one -time pipeline” rearrange the updates of the response samples and the form of the form to reduce the bottle and the time to accelerate inactivity.

One -time pipeline
One -time pipeline

Their experiences showed that one -time pipeline provides up to 2x acceleration to codish RL tasks compared to basic applications. This improvement was very important to train Deepcoder during a reasonable time frame (2.5 weeks on 32 H100) and is now open from sources as part of the Verl-Pipeine line for society to use and build on it.

The effect of the institution

The researchers made all the artifacts of training and operation of Deepcoder-14B on Jaytab and Embroidery Under a lenient license.

“By sharing our data and code and the recipe for our training recipe, we enable society to reproduce our work and make RL training available to everyone,” researchers write.

Deepcoder-14B strongly explains a wider and accelerated trend in the scene of artificial intelligence: the rise of high-capacity-capable models.

For the institution’s world, this shift indicates more options and access to advanced models. It is not only advanced performance, or those who want to pay a distinctive application programming interface fee. Models such as Deepcoder can enable institutions of all sizes to benefit from the generation of advanced code and thinking, allocating solutions to their specific needs, and spreading them safely within their environments.

This trend can reduce the entry barrier to adopt artificial intelligence and enhance a more competitive and innovative environmental system, as progress is made through open source cooperation.



https://venturebeat.com/wp-content/uploads/2024/03/a_robot_working_as_a_programmer_writing_on_a_mo.jpg?w=1024?w=1200&strip=all
Source link

Leave a Comment