Meta criteria for new artificial intelligence models a little misleading

One New models of artificial intelligence Mitta was released on Saturday, Maverick, It ranks second on LM ArenaA test that contains human residents compares the outputs of models and choosing what they prefer. But it seems that the MATA published version in LM Arena is different from the widely available version of developers.

like numerous Amnesty International Researchers In his announcement, Meta indicated that the MAPERICK On LM Arena is “Experimental Chat version”. Planned Lama official websiteMeanwhile, it revealed that the Meta LM ARNA test was performed using “Llama 4 MAVERKK Optimized for Enalvality.”

As we wrote beforeFor various reasons, LM Arena was not the most reliable scale for the performance of the artificial intelligence model. But artificial intelligence companies in general did not allocate or their models were better set for better recorded on LM Arena-or not recognized, at least.

The problem is to behave a model into a standard, withholding it, then launching the “vanilla” variable of the same model is that it makes it difficult for developers to predict exactly how the model performs in certain contexts. It is also misleading. Ideally, standards – Sadly insufficient as they are Provide a snapshot of strengths and weaknesses in one model through a set of tasks.

In fact, the researchers in X have Stark is observed Differences in behavior From the downloadable MAVERICK compared to the model hosted on LM Arena. The LM Arena version appears to use a lot of emojis, and gives incredibly long answers.

Well, Llama 4 is Def A L Lo Lo Lo Lo Lo Lo Loot, what is this YAP city pic.twitter.com/y3gvhbvz65

– Nathan Lambert (Natolmebert) April 6, 2025

For some reason, the Llama 4 model in Arena uses many emojis

together. Amnesty International, it seems better: pic.twitter.com/f74odx4ztt

– Notes Tech Dev (Techdevnotes) April 6, 2025