Is the artificial intelligence product already? How to develop the appropriate measurement system

Join daily and weekly newsletters to obtain the latest updates and exclusive content to cover the leading artificial intelligence in the industry. Learn more

In the first work period for me as a product director for automated learning (ML), I was inspired by a simple question of emotional discussions through jobs and leaders: How do we know if this product is already working? The relevant product that I was able to present to internal and external clients. The model enabled the internal teams to determine the best problems our customers face so that they can give priority to the appropriate group of experiments to fix customer problems. With such a complex network of interconnection between internal and external customers, he chooses Correct standards The capacity of the product was decisive to direct it towards success.

Do not track whether your product is working well similar to a plane landing without any instructions to control air traffic. There is absolutely no way you can make enlightened decisions for your customer without knowing what is right or wrong. In addition, if the metrics are not actively defined, your team will determine the backup scales. The risk of multiple flavors of “accuracy” or “quality” is that everyone will develop their own version, which leads to a scenario in which not all of us may work on the same result.

For example, when I reviewed my annual goal and the basic scale with our engineering team, the immediate comments were: “But this is a business measure, we are already following accuracy and remembering.”

First, select what you want to know about your artificial intelligence product

Once you reach the task of determining the scales of your product – where do you start? In my experience, the complexity of running ML product With many clients, they translate into setting standards for the model as well. What do I use to measure if the model is working well? Measuring the results of the interior teams to determine the priorities of the launch based on our models will not be fast enough; Measuring whether the customer adopted by the recommended solutions by our model can risk extracting conclusions from a very wide adoption scale (what if the customer does not adopt the solution because they only want to reach a support agent?).

Fast forward to era Language models (LLMS) – where we do not just out of one ML model, we have text answers, pictures and music as outputs as well. The dimensions of the product that requires standards now increases quickly – coordination, customers, writing … the list continues.

In all of my products, when I try to reach standards, my first step is to distill what I want to know about its effect on customers in some major questions. Determining the correct set of questions makes it easy to determine the correct set of scales. Here are some examples:

Did the customer get out? → Covering scale
How long did the product take to provide a way out? → Questing scale
Do you like the user to take out? → Customer comments standards and customer accreditation and retain them

Once your main questions are determined, the next step is to determine a set of sub -questions for “input” and “output” signals. Output standards are backward indicators where you can measure an event already. The leading input scales and indicators can be used to determine directions or predict the results. See below to get ways to add the correct sub -questions to the backward and leading indicators to the above questions. Not all questions need to have late/late driving indicators.

Did the customer get out? → Covering
How long did the product take to provide a way out? Cumin
Do you like the user to take out? → Customer notes and customer accreditation and retain them
1. Did the user indicate that the output is correct/error? (Directing)
2. Was the directing good/fair? (entrance)

The third and final step is to determine the method of collecting scales. Most of the scales are widely collected by new tools through data engineering. However, in some cases (such as question 3 above), especially for ML products, you have the option of manual or mechanism assessments that assess the outputs of the model. Although it is always better to develop automatic evaluations, starting with the manual assessments of “Was the output good/fair” and creating a typical rate of good and fair and not good definitions will help you set the foundation for a strict and expensive automatic evaluation process as well.

An example of the use of cases of use: Look for artificial intelligence, descriptions of inclusion

The above frame can be applied to any Product based on ml To determine the list of the basic measures of your product. Let’s look for an example.

a question	Scales	The nature of the scale
Did the customer get out? → Covering	% Search sessions with the search results offered to the customer	Output
How long did the product take to provide a way out? Cumin	The time to display search results for the user	Output
Do you like the user to take out? → Customer notes and customer accreditation and retain them Did the user indicate that the output is correct/error? (Directing) Was the directing good/fair? (entrance)	% Of search sessions with “thumb” UP on search results from the customer or % of the search sessions with clicks from the customer % Of the search results that were distinguished as “good/fair” for each search term, for each quality model	Output entrance

What about a product to create descriptions for a list (whether it is a list in Doordash or Amazon product list)?

a question	Scales	The nature of the scale
Did the customer get out? → Covering	% Lists with generated description	Output
How long did the product take to provide a way out? Cumin	The time to create descriptions for the user	Output
Do you like the user to take out? → Customer notes and customer accreditation and retain them Did the user indicate that the output is correct/error? (Directing) Was the directing good/fair? (entrance)	% Of the lists with created descriptions that require editing from a team/seller/customer of technical content % Of the descriptions of the insertion bearing a “good/fair” mark, for each model quality	Output entrance

The approach shown above can be expanded to many ML products. I hope this framework will help you to determine the set of measures appropriate for your ML model.

Sharanya Rao is a collective product manager in Intuit.

Daily visions about business use cases with VB daily

If you want to persuade your boss at work, you have covered VB Daily. We give you the internal journalistic precedence over what companies do with obstetric artificial intelligence, from organizational transformations to practical publishing operations, so that you can share visions of the maximum return on investment.

Read with us privacy policy

Thanks for subscribing. Check more VB news bulletins here.

An error occurred.