Artificial intelligence laboratories are increasingly dependent on credibility platforms Chatbot Arena To investigate the strengths and weaknesses in their latest models. But some experts say that there are serious problems in this approach From an ethical and academic perspective.
Over the past few years, laboratories including Openai, Google and Meta have turned into user recruitment platforms to help assess the capabilities of upcoming models. When the model records positively, the laboratory behind it often describes this result as evidence of a meaningful improvement.
However, this is a defective approach, according to Emily Bandar, a professor of linguistics at the University of Washington and a co -author of the “The Kon” book. Bender takes a special problem with Chatbot Arena, which aims to volunteers with pushing two unknown models and choosing the response they prefer.
Bandar said: “In order to be valid, the standard needs to measure a specific thing, and it must have health-any, there must be evidence that building interest is well defined and that the measurements are actually related to the creation.” “Chatbot Arena did not explain that voting for one exit on another person is actually associated with preferences, however it may be defined.”
Asmelash Teka Hadgu, co -founder of Ai Lesan and his colleague at the distributed artificial intelligence research Institute, said he believed that standards such as Chatbot Arena were chosen by AI Labs to promote exaggerated demands. Hadgu referred to a recent controversy that included the Meta’s Llama 4 MAVERICK model. Meta has set a copy of MAVERICK to register well on Chatbot ArenaJust to block this model in favor of launching a The worst performance version.
“The criteria should be dynamic instead of fixed data groups,” said Hadgu.
Hadgu and Kristine Gloria, who previously led the initiative of emerging and smart technologies at the Asbin Institute, also compensated for the typical residents for their work. Gloria said that artificial intelligence laboratories should learn from the errors of making data marks, which are infamous she has to exploit Practices. (Some of the laboratories were accused From the same thing.)
Gloria said: “In general, the collective measurement process is valuable and reminds me of the citizens’ science initiatives, “Gloria said. “Ideally, it helps to bring additional views to provide some depth in both evaluation and data control. But the criteria should not be the only measure of evaluation. With industry move and innovation quickly, the standards can become unreliable quickly.”
Matt Friedrikson, CEO of Gray Swan AI, who manages red group campaigns designed for models. (Gray Swan also gives cash prizes for some tests.
“(D) Ellenes also needs to rely on internal standards, the algorithm red difference, and the red teams who suffer from a more open approach or provide specific experience in the field.” “It is important for both developers models and standard creators, who suffer from the collective or otherwise, to clearly communicate the results to those who follow, and are responsive when they are a question.”
Alex Atlla, CEO of Model Marketplace OpenRouter, recently partnership with Openai to give users early access to GPT-4.1 Openai modelsThe open test and the measurement of the models alone said, “Not enough.” And so did the Wei-Lin Chiang, a PhD student at the University of California at Berkeley and one of the founders of LMARNA, who maintains Chatbot Arena.
“We are definitely supporting the use of other tests,” said Qiang. “Our goal is to create an open space worthy of confidence that measures the preferences of our society about various artificial intelligence models.”
Chiang said that accidents such as standard contradiction in Mavrick are not a result of the treatment of a defect in the design of Chatbot Arena, but laboratories misunderstand their policy. Qiang said LM Arena has taken steps to prevent future contradictions from occurring, including updating its policies to “enhance our commitment to fairly repetitive evaluations.”
“Our society is not here as volunteers or testing the model,” said Qiang. “People use LM Arena because we give them an open and transparent place to interact with artificial intelligence and give collective notes. As long as the leading painting reflects the voice of society, we welcome its participation.”
https://techcrunch.com/wp-content/uploads/2022/12/GettyImages-1367281424.jpg?resize=1200,800
Source link