Join daily and weekly newsletters to obtain the latest updates and exclusive content to cover the leading artificial intelligence in the industry. Learn more
Language models can be circulated better when they are left to create their own solutions New study Posted by Hong Kong University and California University, Berkeley. The results that apply to both Language models (LLMS) and VLMS models, challenging one of the main beliefs of the LLM community-that models require manual signs training. In fact, researchers explain that training models on many handcuffed examples can have harmful effects on the model’s ability to generalize invisible data.
SFT VS RL in typical training
For a long time, the control control (SFT) was the golden standard for LLMS and VLMS training. Once the model is pre -trained in raw text data and images, it is usually implemented in the field of raw text data and laboratories on a large collection of data from handmade examples in coordinating questions/answer or request/response. After SFT, the model can undergo additional training stages, such as Learning reinforcement from human reactions (RLHF), where the model tries to learn implicit human preferences based on signals such as answer classifications or admiration/repetition of the model responses.
SFT is useful to direct the behavior of the model towards the type of tasks designed by creative models. However, data collection is a slow and costly process, which is the bottleneck for many companies and laboratories.
Modern developments in LLMS have created attention in pure reinforcement learning approaches (RL), where the model is given a task and left to learn it on its own without handmade examples. The most important situation is Deepsek-R1Openai O1 competitor who was mostly used to learn reinforcement to learn complex thinking tasks.
Circular against memorization
One of the main problems of automated learning systems (ML) is to overcome the overcoming model, as the model works well on its training data but it failed to generalize in invisible examples. During training, the model gives a wrong impression on learning the task, while in practice he had memorized training examples. In large and complex artificial intelligence models, the generalization of generalization can be difficult.
The new study focuses on the capabilities of RL and SFT training in textual and visual thinking tasks. For textual thinking, LLM, trained on a set of rules, should be able to generalize the variables of these rules. In visual thinking, VLM should remain consistent in performing the task for changes in different aspects of visual inputs, such as color and spatial planning.

In their experiences, researchers used two representative tasks. The first was generalpoints, a standard that evaluates the computational thinking capabilities of the model. The form is given four cards, as text descriptions or pictures, and it is asked to combine them to reach a target number. To study the ruling -based circular, the researchers trained the model using one set of rules, then evaluated it using a different base. For visual circulation, they trained the model using one color cards and tested its performance on other colors and numbering plans.
The second task V-flickWhich tests the possibilities of spatial thinking of the model in the field of movement in the open world that uses realistic visual inputs. This task also comes in pure and language versions. The researchers assessed the circular by changing the type of visual instructions and representations, the model was trained and tested.

They conducted their tests on Llama-3.2-Vision-11BHeating the form by training it on a small SFT data set, then creating separate versions for each task and training model. For each task, they expanded the range of training separately on RL and SFT. The SFT model is trained in hand -made solutions, while RL allows the model to create many solutions for each problem, evaluate the results and train itself on the right answers.
Results show that reinforcement learning constantly improves performance on examples that are very different from training data. On the other hand, SFT appears to preserve the training rules and are not generalized on examples outside the distribution. These notes apply to each of the text settings only and multi -media.

The effects of real world applications
Although their experiences show that RL is better in circular from SFT, researchers have also found that SFT is useful for installing format format format, which is very important to enable RL to make performance gains. The researchers found that without the initial SFT stage, RL did not achieve desired results.
This is slightly different from the results obtained by Deepseek-R1-Zero, which was trained after RL Pure. Researchers suggest that this could be due to the various spine model they used in their experiments.
It is clear that there is a lot of unexploited capabilities in the heavy approach. For verified cases of use, allowing models to learn on their own can lead to unexpected results that people cannot make. This can be very useful in settings as it can be created by dull and expensive manual examples.
https://venturebeat.com/wp-content/uploads/2025/02/43OM9mruR_eWSFVQg2jVZA.jpeg?w=1024?w=1200&strip=all
Source link