Join our daily and weekly newsletters for the latest updates and exclusive content on our industry-leading AI coverage. He learns more
Chinese AI startup DeepSeek, known for challenging leading AI vendors with its innovative open source technologies, today launched a new ultra-large model: DeepSeek-V3.
Available via Face hugging Under the company’s licensing agreement, the new model comes with 671B parameters but uses an expert mix architecture to activate only specific parameters, in order to handle specific tasks accurately and efficiently. According to benchmarks shared by DeepSeek, the offering is already topping the charts, outperforming leading open source models, including Metta Lama 3.1-405b,and closely matches the performance of closed models from Anthropic and OpenAI.
The release represents another major development that bridges the gap between closed and open source AI. Eventually, there was DeepSeek, which started as an offshoot of a Chinese quantitative hedge fund High-flying capital managementHe hopes that these developments will pave the way for artificial general intelligence (AGI), where models will have the ability to understand or learn any intellectual task that a human can perform.
What does DeepSeek-V3 bring to the table?
Just like its predecessor, the DeepSeek-V2, the new ultra-large model uses the same core architecture around it Multi-headed latent attention (MLA) and DeepSeekMoE. This approach ensures that efficient training and inference is maintained – with dedicated, shared “experts” (smaller individual neural networks within the larger model) activating 37B parameters out of 671B for each token.
While the infrastructure ensures solid performance for DeepSeek-V3, the company has also launched two innovations to push the bar further.
The first is the additional loss-free load balancing strategy. This dynamically monitors and adjusts the load on experts to utilize them in a balanced manner without compromising the overall performance of the model. The second is multi-symbol prediction (MTP), which allows the model to predict many future symbols simultaneously. This innovation not only enhances training efficiency, but enables the model to perform three times faster, generating 60 codes per second.
“During pre-training, we trained DeepSeek-V3 on 14.8T of high-quality, diverse tokens… Next, we performed a two-stage context length extension of DeepSeek-V3,” the company wrote in a post. Technical paper Details of the new model. “In the first stage, the maximum context length is extended to 32 KB, and in the second stage, it is extended to 128 KB. Next, we conducted post-training, including supervised fine-tuning (SFT) and reinforcement learning (RL ) On the base model of DeepSeek-V3, to align it with human preferences and further unleash its potential during post-training, we extract the reasoning ability from DeepSeekR1 model seriesAt the same time carefully maintain the balance between model accuracy and generating length.
It is worth noting that during the training phase, DeepSeek used several hardware and algorithm optimizations, including the FP8 hybrid precision training framework and the DualPipe algorithm for pipeline parallelism, to reduce process costs.
Overall, it claims to have completed the entire DeepSeek-V3 training in about 2,788 thousand H800 GPU hours, or about $5.57 million, assuming a rental price of $2 per GPU hour. This is far less than the hundreds of millions of dollars typically spent on large pre-training language models.
For example, it is estimated that Llama-3.1 was trained with an investment of more than $500 million.
The most powerful open source model currently available
Despite the economic training, DeepSeek-V3 has emerged as the most powerful open source model on the market.
The company ran multiple benchmarks to compare the AI’s performance and noted that it convincingly outperforms leading open models, including the Llama-3.1-405B and Qwen 2.5-72B. It even outperforms closed sources GPT-4o In most benchmarks – with the exception of the English-focused SimpleQA and FRAMES – the OpenAI model delivers with scores of 38.2 and 80.5 (vs. 24.9 and 73.3), respectively.
It is worth noting that DeepSeek-V3’s performance particularly stood out in Chinese and mathematics benchmarks, where it recorded better results than all of its counterparts. On the Math-500 test, he received a score of 90.2, with Quinn’s score of 80 being the best.
The only model that was able to challenge DeepSeek-V3 was Anthropic Claude 3.5 Sonnetoutperforming them with higher scores in MMLU-Pro, IF-Eval, GPQA-Diamond, SWE Verified, and Aider-Edit.
— DeepSeek (@deepseek_ai) December 26, 2024
Introducing DeepSeek-V3!
Biggest leap forward yet:60 tokens/second (3x faster than V2!)
Enhanced capabilities
API compatibility intact
Fully open-source models & papers
1/n pic.twitter.com/p1dV9gJ2Sd
The work shows that open source approaches closed source models, promising roughly equivalent performance across different tasks. The development of such systems is a very good thing for the industry because it will potentially eliminate the chances of a large AI player taking over the game. It also gives organizations multiple options to choose from and work with while coordinating their collections.
Currently, the DeepSeek-V3 code is available via github under the MIT License, while the template is provided under the company’s template license. Businesses can also test the new model via Deep Sick Chat,ChatGPT-like platform, and application programming interface (API) access for commercial use. DeepSeek provides an application programming interface (API). Same price as DeepSeek-V2 Until February 8th. You will then be charged $0.27/million tokens for input ($0.07/million tokens with cache results) and $1.10/million tokens for output.

Source link