Join daily and weekly newsletters to obtain the latest updates and exclusive content to cover the leading artificial intelligence in the industry. Learn more
In the first work period for me as a product director for automated learning (ML), I was inspired by a simple question of emotional discussions through jobs and leaders: How do we know if this product is already working? The relevant product that I was able to present to internal and external clients. The model enabled the internal teams to determine the best problems our customers face so that they can give priority to the appropriate group of experiments to fix customer problems. With such a complex network of interconnection between internal and external customers, he chooses Correct standards The capacity of the product was decisive to direct it towards success.
Do not track whether your product is working well similar to a plane landing without any instructions to control air traffic. There is absolutely no way you can make enlightened decisions for your customer without knowing what is right or wrong. In addition, if the metrics are not actively defined, your team will determine the backup scales. The risk of multiple flavors of “accuracy” or “quality” is that everyone will develop their own version, which leads to a scenario in which not all of us may work on the same result.
For example, when I reviewed my annual goal and the basic scale with our engineering team, the immediate comments were: “But this is a business measure, we are already following accuracy and remembering.”
First, select what you want to know about your artificial intelligence product
Once you reach the task of determining the scales of your product – where do you start? In my experience, the complexity of running ML product With many clients, they translate into setting standards for the model as well. What do I use to measure if the model is working well? Measuring the results of the interior teams to determine the priorities of the launch based on our models will not be fast enough; Measuring whether the customer adopted by the recommended solutions by our model can risk extracting conclusions from a very wide adoption scale (what if the customer does not adopt the solution because they only want to reach a support agent?).
Fast forward to era Language models (LLMS) – where we do not just out of one ML model, we have text answers, pictures and music as outputs as well. The dimensions of the product that requires standards now increases quickly – coordination, customers, writing … the list continues.
In all of my products, when I try to reach standards, my first step is to distill what I want to know about its effect on customers in some major questions. Determining the correct set of questions makes it easy to determine the correct set of scales. Here are some examples:
- Did the customer get out? → Covering scale
- How long did the product take to provide a way out? → Questing scale
- Do you like the user to take out? → Customer comments standards and customer accreditation and retain them
Once your main questions are determined, the next step is to determine a set of sub -questions for “input” and “output” signals. Output standards are backward indicators where you can measure an event already. The leading input scales and indicators can be used to determine directions or predict the results. See below to get ways to add the correct sub -questions to the backward and leading indicators to the above questions. Not all questions need to have late/late driving indicators.
- Did the customer get out? → Covering
- How long did the product take to provide a way out? Cumin
- Do you like the user to take out? → Customer notes and customer accreditation and retain them
- Did the user indicate that the output is correct/error? (Directing)
- Was the directing good/fair? (entrance)
The third and final step is to determine the method of collecting scales. Most of the scales are widely collected by new tools through data engineering. However, in some cases (such as question 3 above), especially for ML products, you have the option of manual or mechanism assessments that assess the outputs of the model. Although it is always better to develop automatic evaluations, starting with the manual assessments of “Was the output good/fair” and creating a typical rate of good and fair and not good definitions will help you set the foundation for a strict and expensive automatic evaluation process as well.
An example of the use of cases of use: Look for artificial intelligence, descriptions of inclusion
The above frame can be applied to any Product based on ml To determine the list of the basic measures of your product. Let’s look for an example.
a question | Scales | The nature of the scale |
---|---|---|
Did the customer get out? → Covering | % Search sessions with the search results offered to the customer | Output |
How long did the product take to provide a way out? Cumin | The time to display search results for the user | Output |
Do you like the user to take out? → Customer notes and customer accreditation and retain them Did the user indicate that the output is correct/error? (Directing) Was the directing good/fair? (entrance) | % Of search sessions with “thumb” UP on search results from the customer or % of the search sessions with clicks from the customer % Of the search results that were distinguished as “good/fair” for each search term, for each quality model | Output entrance |
What about a product to create descriptions for a list (whether it is a list in Doordash or Amazon product list)?
a question | Scales | The nature of the scale |
---|---|---|
Did the customer get out? → Covering | % Lists with generated description | Output |
How long did the product take to provide a way out? Cumin | The time to create descriptions for the user | Output |
Do you like the user to take out? → Customer notes and customer accreditation and retain them Did the user indicate that the output is correct/error? (Directing) Was the directing good/fair? (entrance) | % Of the lists with created descriptions that require editing from a team/seller/customer of technical content % Of the descriptions of the insertion bearing a “good/fair” mark, for each model quality | Output entrance |
The approach shown above can be expanded to many ML products. I hope this framework will help you to determine the set of measures appropriate for your ML model.
Sharanya Rao is a collective product manager in Intuit.
https://venturebeat.com/wp-content/uploads/2025/04/DDM-Robot.webp?w=1024?w=1200&strip=all
Source link