Deepseek’s success shows the reason that the motivation is the key to innovation from artificial intelligence

Photo of author

By [email protected]


Join daily and weekly newsletters to obtain the latest updates and exclusive content to cover the leading artificial intelligence in the industry. Learn more


January 2025 Around Amnesty International. Openai that apparently uncomfortable and strong American technology giants shocked what we can definitely call the weak in the field of large language models (LLMS). Deepseek, a Chinese company not on the radar of anyone, suddenly challenged Openai. It is not that Deepseek-R1 was better than the best models of America’s giants. It was a little late in terms of standards, but it suddenly made everyone think of efficiency in terms of use of devices and the use of energy.

Given the lack of the best advanced devices, Deepseek appears to have been enthusiastic about creativity in the field of efficiency, which was a less concern for big players. Openai claimed that they have evidence to indicate Dibsic Maybe they used their training model, but we have no concrete evidence to support this. Therefore, whether it is true or that Openai simply tries to satisfy their investors the subject of discussion. However, Deepseek has published their business, and people have verified that the results are at least a much smaller scale.

But how can Dibsic Achieving such a cost while American companies cannot? The short answer is simple: they had more motivation. The long answer requires a little more than the technical interpretation.

Use Deepseek improve KV-CACHE

One of the important costs of GPU was to improve the main storage memory of the main value used in each LLM interest layer.

LLMS consists of transformer blocks, each of which includes an attention layer followed by a regular vanilla fodder network. The theoretical nutrition network models in terms of concept, but in practice, it is difficult to always determine the patterns of data. Attention layer solves this problem for the modeling of the language.

The model processes the texts using symbols, but for simplicity, we will refer to it as words. In LLM, each word is set in a high dimension (for example, a thousand with several). Concilious, every dimension represents a concept, such as being hot or cold, being green, being soft, being a name. The representation of the word transmission is its meaning and values ​​according to each of.

However, our language allows other words to modify the meaning of each word. For example, an apple has a meaning. But we can have a green apple as a modified version. A more extreme example of the modification is that the Apple in the context of the iPhone differs from the Apple in the context of the meadow. How do we allow our system to modify the meaning of the word on the basis of another word? This is where attention comes.

The attention model is appointed by two other vectors of each word: key and inquiry. The query represents the characteristics of the meaning of the word that can be modified, and the key represents the type of modifications it can provide for other phrases. For example, the word “green” can provide information about color and green. Therefore, the key to the word “green” will have a high value on the “green” dimension. On the other hand, the word “Apple” can be green or not, and therefore the “Apple” query header will have a high value for the green dimension. If we take a Dot product for the “Green” key with inquiries about “Apple”, the product must be relatively large compared to the “table” key and inquiring about “Apple”. Then the attention layer adds a small part of the word “green” to the word “Apple”. In this way, the value of the word “Apple” is modified to be more green.

When LLM creates a text, it does it one word after another. When a word is born, all the words that were previously created become part of its context. However, the keys and values ​​of these words are already calculated. When adding another word to the context, its value must be updated based on its inquiries, keys and values ​​of all previous words. For this reason all these values ​​are stored in GPU. This is the KV cache.

Deepseek decided that the key and the value of the word are linked. Therefore, it is clear that the meaning of the green word and its ability to influence the greenery is closely related. Therefore, it is possible to pressure both of them as one (and maybe smaller) header and cancel pressure on it during treatment very easily. Deepseek has found that it affects Performance on standardsIt provides a lot of the memory of the graphics processing unit.

Deepseek app

The nature of the nerve network is that the entire network needs to be evaluated (or calculating) for each query. However, not all of this is a useful account. Knowledge of the world sits in the weights or parameters of the network. Do not use knowledge about the Eiffel Tower to answer questions about the history of the South American tribes. Knowing that apples are a fruit that is not useful while answering questions about the general theory of relativity. However, when the network is calculated, all parts of the network are treated regardless. This bears huge calculation costs while generating the text that must be perfectly avoided. This is where the idea of ​​a mixture of experience comes.

In the MEE model, the nerve network is divided into multiple smaller networks called experts. Note that the “expert” is not defined explicitly; The network monitors it during training. However, networks allocate some importance to each query and stimulate parts only with higher matching degrees. This provides huge savings in the calculation. Note that some questions need experience in multiple areas to be answered properly, and such queries will deteriorate. However, since the areas are discovered from data, the number of these questions is reduced.

The importance of learning reinforcement

LLM is taught to think through a series of thoughts, with the model that was set to imitate thinking before providing the answer. The model is required to spoil his thinking (create thought before creating the answer). Then the model is evaluated on both thought and answer, and training in reinforcement learning (the reward for a correct match and punishing an incorrect match with training data).

This requires expensive training data with the thought code. Deepseek request from the system only create ideas between signs and And to create the answers between the signs and . The model is rewarded or punished based on the model (use of signs) and matching answers. This requires much less expensive training data. During the early stage of the RL, the model tried to generate a little thinking, which led to incorrect answers. In the end, the model learns to generate long and coherent ideas, which is what Deepseek calls the moment of “A-Ha”. After this point, the quality of the answers improved a lot.

Deepseek employs several additional improving tricks. However, it is very technique, so I will not feed on it here.

Final ideas about Dembeic and the Grand Market

In any technical research, we first need to know what is possible before improving efficiency. This is natural progress. Dibsic’s contribution to a massive LLM scene. Academic contribution cannot be ignored, whether trained using Openai Output. It can also transform the way startups work. But there is no reason for Openai or other American giants to despair. This is how Research work – One set of other groups search benefits. Dibsik certainly benefited from the previous research conducted by Google, Openai and many other researchers.

However, the idea that Openai will dominate the LLM world indefinitely now is very unlikely. There will be no degree of organizational pressure or preparation of fingers. This technology is in the hands of many and outside it in the open, which makes its progress uncomfortable. Although this may be a little headache for investors in Openai, it eventually wins our rest. While the future belongs to many, we will always be grateful to the first shareholders such as Google and Openai.

Debasish Ray Chawdhuri is the first lead engineer in Talentica program.



https://venturebeat.com/wp-content/uploads/2025/04/DDM-Whale.webp?w=1024?w=1200&strip=all
Source link

Leave a Comment