Join daily and weekly newsletters to obtain the latest updates and exclusive content to cover the leading artificial intelligence in the industry. Learn more
Each of the artificial intelligence version inevitably includes the plans that link how it surpassed its competitors in this standard test or the evaluation matrix.
However, these criteria often test general capabilities. For institutions that want to use models and agents based on the large language model, it is difficult to assess the actual understanding of the agent or model their own needs.
Model warehouse Embroidery Firing YouBenchAn open source tool where developers and institutions can create their own standards for the model performance test for their internal data.
Sumuk Schesshehar, part of the Huggeing Face FACE, announced, On x. The feature provides “allocated measurement and artificial data generation from any of your documents. It is a big step towards improving how to make models’ evaluation.
He added that Huging Face knows that “for many cases of use, what really matters is the extent to which the model performs your specific task.
Create dedicated reviews
Embroidery He said on paper Ownbench repeats sub -groups of multiple task understanding standards (MMLU) “using the minimum source text, and achieving this for less than $ 15 in the cost of total inference while maintaining completely typical relative performance classifications.”
Organizations need a pre -monitoring process before they can work. This involves three stages:
- Swalcation document To “Normalize” file formats.
- Semantic To destroy the documents to meet the limits of the context window and focus the attention of the model.
- Summarize the document
After that, the process of generating questions and answers comes, which creates questions of information about documents. This is where the user LLM brings the chosen to find out the best answers to the questions.
The embracing face tested with Deepseek V3 and R1 models, QWEN models from Alibaba including QWEN QWQ thinking form, Mistral Large 2411, Mistral 3.1 Small, Llama 3.1 and Llama 3.3, Gemini 2.0 Flash, GEMINI 2.0 Flash and Gemma 3, GPT-4O, Sonata and Claude 3.5 Haiko.
Cashidhar said the hug also provides an analysis of the cost on the models and found that QWEN and Gemini 2.0 Flash “produces a huge value for very low costs.”
Restrictions account
However, the creation of LLM customized standards based on the organization’s documents comes at a cost. Ownbench requires a lot of account strength to work. X -Shachehahar said that the company “adds a capacity” as quickly as possible.
Embrading the face runs Several graphics processing units And partners with companies like Google to use Their cloud services For the tasks of inference. Venturebeat communicates to embrace your face about using your account.
The measurement is not perfect
Other standards and evaluation methods give users an idea of performing models well, but this does not pick up how the models will work daily.
some Until suspicion crossed These standard tests show restrictions on models and can lead to wrong conclusions about their safety and performance. I also warned this It can be “misleading” measurement factors.
However, institutions cannot avoid evaluating models now because there are many options in the market, and to justify technology leaders the increasing cost to use artificial intelligence models. This has led to various ways to test the performance of the model and reliability.
Google Depp Mind presented Introduction factsWhich tests the ability of the model to generate accurate responses realistically based on information from documents. Some researchers developed at the University of Yale and Tengahwa Self -installation code standards To direct the institutions that the LLMS codes are working for them.
https://venturebeat.com/wp-content/uploads/2025/02/nuneybits_Vector_art_of_a_smiling_hugging_face_emoji_with_arms__a0538182-2148-4c6e-a5cb-6b49ebd883ea.webp?w=853?w=1200&strip=all
Source link