Until recently, human experts have been the ultimate judges of many AI models. A fraud-detection model is considered good if it matches the performance of professional fraud analysts. A medical-diagnosis model is deemed effective if its recommendations align with those of experienced doctors.
The presence of human experts can often be substituted with labels. Instead of fraud analysts reviewing every AI prediction, they can label a set of transactions with their judgments: This transaction is fraudulent; this one is not. A model makes predictions on these same transactions, which are then compared to the experts' prerecorded judgments.
This approach has two problems. First, the judgments of human experts aren't always available. In many cases, users turn to AI because human experts are too expensive, too slow, or inaccessible. Second, as AI becomes more powerful, AI's capabilities can surpass those of human experts for certain tasks. For these tasks, human experts are no longer the gold standard against which AI's performance can be benchmarked.
For many use cases, application developers have to evaluate AI models that outperform them. It's like a student having to evaluate the teacher's solution. How do they do it? Are there strategies for evaluating AI models that are smarter than us? This article considers three such strategies.
• Functional correctness – evaluating AI by how well it accomplishes its intended tasks.
• AI-as-a-judge – using AI instead of human experts to evaluate AI outputs.
• Comparative evaluation – evaluating AI systems in relationship with each other instead of independently.
The discussion centers around foundation models—the most powerful category of models as of this writing—which include both LLMs (large language models) and LMMs (large multimodal models).
The article does not go into the philosophical debate of whether AI is smarter than humans or what it means to be intelligent. Regardless of your philosophical view, the reality is that users already use AI to perform tasks that the users themselves can't comfortably perform, such as understanding complex documents, solving challenging math questions, and creating beautiful designs. Having a reliable strategy to evaluate the outputs of AI for these tasks will help users gain more confidence in using AI for increasingly high-stakes tasks.
Evaluating AI that's smarter than us is closely related to scalable oversight: the problem of supervising systems that potentially outperform humans on most skills relevant to the task at hand (Amodei et al., 2016; Bowman et al., 2020). Scalable oversight is concerned with training models that are more capable than humans. Doing so, however, requires a reliable way to evaluate these models and benchmark their progress.
The more AI is being used, the more opportunities there are for catastrophic failures. Many such failures have already occurred in the short time that foundation models have been around. A man committed suicide after being encouraged by a chatbot. Lawyers have submitted false evidence hallucinated by AI. Air Canada was ordered to pay damages when its AI-powered chatbot gave false information to a passenger. If there is no method for quality control of AI outputs, the risk of AI might outweigh its benefits for many applications.
Before investing time, money, and resources into building an application, it's important to understand how this application will be evaluated. I call this approach evaluation-driven development, which means defining evaluation criteria before building. The name is inspired by test-driven development in software engineering, which refers to the process of writing tests before writing code.
While some companies chase the latest hype, sensible business decisions are still being made based on returns on investment, not hype. Applications should demonstrate value to be deployed. As a result, the most common enterprise applications in production are those with clear evaluation criteria:
• Recommender systems are common because their successes can be evaluated by an increase in engagement or purchase-through rates. (Recommendations can increase purchases, but increased purchases are not always because of good recommendations. Other factors such as promotional campaigns and new product launches can also increase purchases. It's important to do A/B testing to differentiate impact.)
• The success of a fraud-detection system can be measured by how much money is saved from prevented fraud.
• Coding is a common generative AI use case because, unlike other generation tasks, generated code can be evaluated using functional correctness.
While the evaluation-driven development approach makes sense from a business perspective, focusing only on applications whose outcomes can be measured is similar to looking for the lost key under the lamppost (at night). It's easier to do, but it doesn't mean you'll find the key. Many potentially game-changing applications may be dismissed because there is no easy way to evaluate them.
Evaluation is the biggest bottleneck to AI adoption. Being able to build reliable evaluation pipelines will unlock many new applications.
Evaluating AI models has always been difficult. Evaluating AI models that are smarter than us is even more so.
First, the more intelligent AI models become, the harder it is to evaluate them. Most people can tell if a first grader's math solution is wrong. Few can do the same for a Ph.D.-level math solution. When OpenAI's GPT-o1 came out in September 2024, the Fields medalist Terence Tao compared his experience with this model to working with "a mediocre, but not completely incompetent, graduate student." He speculated that it may take only one or two further iterations until AI reaches the level of a "competent graduate student." In response to his assessment, many people joked that if we're already at the point where we need the brightest human minds to evaluate AI models, we'll have no one qualified to evaluate future models.
AI can also be much more efficient than humans, making the task of verifying its outputs time consuming. It's easy to tell if a book summary is bad if it's gibberish, but a lot harder if the summary is coherent. To validate the quality of this summary, you might need to read the book first. You'll also need to fact-check, reason, and even incorporate domain expertise, all of which can take tremendous time and effort, limiting how many outputs you can evaluate.
Second, the open-ended nature of foundation models undermines the traditional approach of evaluating a model against ground truths (labels). With traditional ML (machine learning), most tasks are close-ended. For example, a classification model can output among only the expected categories. To evaluate a classification model, you can evaluate its outputs against the labels. If the label is category X but the model's output is category Y, the model is wrong. For an open-ended task, however, for a given input, there are so many possible correct responses that it's impossible to curate a comprehensive list of correct outputs to compare against.
At the same time, publicly available evaluation benchmarks have proven to be inadequate for evaluating foundation models. Ideally, evaluation benchmarks should capture the full range of model capabilities. As AI progresses, benchmarks need to evolve to catch up. A benchmark becomes saturated for a model once the model achieves a perfect score. With foundation models, benchmarks are becoming saturated fast. The benchmark GLUE (General Language Understanding Evaluation) came out in 2018 and became saturated in just a year, necessitating the introduction of SuperGLUE in 2019. Similarly, NaturalInstructions (2021) was replaced by Super-NaturalInstructions (2022). MMLU (Massive Multitask Language Understanding), which was a strong benchmark introduced in 2020 that many early foundation models relied on, was largely replaced by MMLU-Pro in 2024.
As teams rush to adopt AI, many quickly realize that the biggest hurdle to bringing AI applications to reality is evaluation. For some applications, figuring out evaluation can take up the majority of the development effort.
Because evaluation is difficult, many people settle for word of mouth (e.g., someone says that model X is good) or eyeballing the results (also known as vibe check). This creates even more risk and slows down application iteration. Instead, an investment in systematic evaluation is needed to make the results more reliable.
Given the importance of evaluation, tremendous effort is being invested into studying methods to evaluate increasingly intelligent AI. While many techniques are still actively being developed, the three common approaches are functional correctness, AI-as-a-judge, and comparative evaluation.
The most reliable metric to evaluate AI is how well it accomplishes the task it is assigned to do—its functional correctness. For example, if you ask a model to create a website, does the generated website meet your requirements? If you ask a model to make a reservation at a certain restaurant, does the model succeed? If you ask a model to speed up a piece of code, how much faster does the optimized code run?
Code generation is an example of a task where functional correctness measurement can be automated. Functional correctness in coding is sometimes called execution accuracy. Say you ask the model to write a Python function, gcd(num1, num2)
, to find the gcd (greatest common denominator) of two numbers, num1
and num2
. The generated code can then be input into a Python interpreter to check whether the code is valid (can be compiled) and, if it is, whether it outputs the correct result of a given pair (num1, num2)
. For example, given the pair (num1=15, num2=20)
, if the function gcd(15, 20)
doesn't return 5, the correct answer, then you know that the function is wrong.
Long before AI was used for writing code, automatically verifying code's functional correctness was standard practice in software engineering. Code is typically validated with unit tests where code is executed in different scenarios to ensure that it generates the expected outputs. Functional correctness evaluation is how coding platforms such as LeetCode and HackerRank validate the submitted solutions.
Popular benchmarks for evaluating AI's code-generation capabilities, such as OpenAI's HumanEval and Google's MBPP (Mostly Basic Python Problems) dataset use functional correctness as their metrics. Benchmarks for text-to-SQL (generating SQL queries from natural languages) such as Spider (Yu et al., 2018), BIRD-SQL (Big Bench for Large-scale Database Grounded Text-to-SQL Evaluation) (Li et al., 2023), and WikiSQL (Zhong, Xiong, and Socher, 2017) also rely on functional correctness.
A benchmark problem comes with a set of test cases. Each test case consists of a scenario the code should run and the expected output for that scenario.
Here's an example of a problem and its test cases in HumanEval:
Problem
from typing import List
def has_close_elements(numbers: List[float], threshold: float) ->
bool:
""" Check if in given list of numbers, are any two numbers
closer to each other than given threshold.
>>> has_close_elements([1.0, 2.0, 3.0], 0.5) False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) True
"""
Test cases (each assert
statement represents a test case)
def check(candidate):
assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True
assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True
assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False
assert candidate([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True
assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True
assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == False
When evaluating a model, HumanEval generates a number of code samples, denoted as k
, for each problem. A model solves a problem if any of the k
code samples it generated pass all of that problem's test cases. The final score, called pass@k
, is the fraction of the solved problems out of all problems. If there are 10 problems and a model solves five with k=3
, then that model's pass@3
score is 50 percent. The more code samples a model generates, the more chances the model has of solving each problem; hence, the greater the final score. This means that in expectation, the pass@1
score should be lower than pass@3
, which, in turn, should be lower than pass@10
.
Another category of tasks whose functional correctness can be automatically evaluated is gamebots. If you create a bot to play Tetris, you can tell how good the bot is by the score it gets. Tasks with measurable objectives can typically be evaluated using functional correctness. For example, if you ask AI to schedule your workloads to optimize energy consumption, the AI's performance can be measured by how much energy it saves.
If your task can be evaluated by functional correctness, that's what you should do. The challenge with functional correctness is that it isn't always straightforward to measure, or its measurement can't be easily automated. Many tasks don't have well-defined, measurable objectives. For example, if you ask a model to write a good essay, how do you define what "good" means?
For many tasks with measurable objectives, AI isn't yet good enough to perform them end to end, so AI can form only part of the solution. Sometimes, evaluating a part of a solution is harder than evaluating the outcome. Imagine you want to evaluate someone's ability to play chess. It's easier to evaluate the end-game outcome (win/lose/draw) than to evaluate the strength of just one move. As AI systems become increasingly more capable, however, they will be able to accomplish more complex tasks end to end, and more applications will be able to rely on functional correctness evaluation.
If AI already outperforms humans in many tasks, can it also outperform humans in evaluating AI? The approach of using AI to evaluate AI is called AI-as-a-judge or LLM-as-a-judge. An AI model that is used to evaluate other AI models is called an AI judge.
As of this writing, AI-as-a-judge has become one of the most, if not the most, common methods for evaluating AI models in production. Most demos of AI evaluation startups I saw in 2023 and 2024 leveraged AI-as-a-judge in one way or another. LangChain's State of AI report in 2023 mentioned that 58 percent of evaluations on its platform were done by AI judges. In my experience, the number of teams willing to adopt this approach has only increased since then.
A reason for AI-as-a-judge's rapid rise in popularity is that it's fast, easy to use, and relatively cheap compared with human evaluators. While it can take hours for a human evaluator to read a book to evaluate the quality of an AI-generated summary, an AI system can do the same in seconds.
You can ask AI models to judge an output based on any criteria: quality, correctness, toxicity, hallucinations, and more. This is similar to how you can ask a person for an opinion about anything. You might think, "But you can't always trust people's opinions." That's true, and you can't always trust AI's judgments either. Since each AI model is an aggregation of the masses, however, it's possible for AI models to make judgments representative of the masses. With the right prompt for the right model, you can get reasonably good judgments on a wide range of topics.
Studies have shown that certain AI judges are strongly correlated to human evaluators, despite being much faster and potentially much cheaper. In 2023 Zheng et al. found that on their evaluation benchmark, MT-Bench, the agreement between GPT-4 and humans reached 85 percent, which is even higher than the agreement among humans (81 percent). AlpacaEval authors (Dubois et al., 2023) also found that their AI judges have a near-perfect (0.98) correlation with LMSYS's (Large Model Systems) Chat Arena leaderboard, which is evaluated by humans.
Not only can AI evaluate a response, but it can also explain its decision, which can be especially useful when you want to audit your evaluation results. Figure 1 shows an example of GPT-4 explaining its judgment.
Its flexibility makes AI-as-a-judge useful for a wide range of applications, and for some applications, it's the only automatic evaluation option. This approach can work even if there are no labels to compare the predictions, making it suitable for the production environment.
Limitations of AI-as-a-judge
Despite the many advantages of AI-as-a-judge, some teams are hesitant to adopt this approach. Using AI to evaluate AI seems tautological. The probabilistic nature of AI makes it seem too unreliable to act as an evaluator. AI judges can potentially introduce nontrivial costs and latency to an application. Given these limitations, some teams see AI-as-a-judge as a fallback option when they don't have any other way of evaluating their systems.
Inconsistency. For an evaluation method to be trustworthy, its results should be consistent. Yet AI judges, like all AI applications, are probabilistic. The same judge, on the same input, can output different scores if prompted differently. Even the same judge, prompted with the same instruction, can output different scores if run twice. This inconsistency makes it hard to reproduce or trust evaluation results.
It's possible to get an AI judge to be more consistent. Zheng et al. (2023) showed that including evaluation examples in the prompt can increase the consistency of GPT-4 from 65 percent to 77.5 percent. They acknowledged, however, that high consistency may not imply high accuracy—the judge might consistently make the same mistakes. On top of that, including more examples makes prompts longer, and longer prompts mean higher inference costs. In Zheng et al.'s experiment, including more examples in their prompts caused their GPT-4 spending to quadruple.
Criteria ambiguity. Unlike many human-designed metrics, AI-as-a-judge metrics aren't standardized, making it easy to misinterpret and misuse them. For example, as of this writing, the open source tools MLflow, Ragas, and LlamaIndex all have the built-in criterion faithfulness to measure how faithful a generated output is to the given context, but their instructions and scoring systems are all different. As shown in table 1, MLflow uses a scoring system from 1 to 5, Ragas uses 0 and 1, whereas LlamaIndex's prompt asks the judge to output YES and NO.
The faithfulness scores outputted by these three tools won't be comparable. If, given a (context, answer
) pair, MLflow gives a faithfulness score of 3, Ragas outputs 1, and LlamaIndex outputs NO, which score would you trust?
An application evolves, but the way it's evaluated should ideally be fixed. This way, evaluation metrics can be used to monitor the application's changes. AI judges are also AI applications, however, which means that they can also change over time.
Imagine that last month your application's coherence score was 90 percent, and this month the score is 92 percent. Does this mean that your application's coherence has improved? It's hard to answer this question unless you know for sure that the AI judges used in both cases are the same. What if the judge's prompt this month is different from the one last month? Maybe you switched to a slightly better-performing prompt, or a coworker fixed a typo in last month's prompt, and the judge this month is more lenient.
This can become especially confusing if the application and the AI judge are managed by different teams. The AI judge team might change the judges without informing the application team. As a result, the application team might mistakenly attribute the changes in the evaluation results to changes in the application rather than changes in the judges.
Evaluation methods take time to standardize. As the field evolves and more guardrails are introduced, let's hope that AI judges will become a lot more standardized and reliable in the future.
Increased costs and latency. Many teams use AI judges as guardrails in production to reduce risks, showing users only those generated responses deemed good by the AI judge. However, this can significantly increase the application's latency and cost, compared with not using a judge at all.
Powerful AI judges can be expensive. If you use the same model both to generate and to evaluate responses, you'll do twice as many inference calls, approximately doubling your API costs. If you have three evaluation prompts because you want to evaluate three criteria—say, overall response quality, factual consistency, and toxicity—you'll increase your number of API calls four times. In some cases, evaluation can take up the majority of the budget, even more than response generation.
You can reduce costs by using weaker models as the judges (see the section, "What Models Can Act as Judges?" later in this article). You can also reduce costs with spot-checking: evaluating only a subset of responses. Spot-checking means you might fail to catch some failures. The larger the percentage of samples you evaluate, the more confidence you will have in your evaluation results—but also the higher the costs. Finding the right balance between cost and confidence might take trial and error. All things considered, AI judges are much cheaper than human evaluators.
Implementing AI judges in your production pipeline can add latency. If you evaluate responses before returning them to users, you face a tradeoff: reduced risk but increased latency. The added latency might make this option a nonstarter for applications with strict latency requirements.
Biases of AI-as-a-judge. Human evaluators have biases, and so do AI judges. Different AI judges have different biases. This section looks at some of the common ones. Being aware of your AI judges' biases helps you interpret their scores correctly and even mitigate these biases.
AI judges tend to have self-bias, where a model favors its responses over the responses generated by other models. The same mechanism that helps a model compute the most likely response to generate will also give this response a high score. In Zheng et al.'s 2023 experiment, GPT-4 favors itself with a 10 percent higher win rate, while Claude-v1 favors itself with a 25 percent higher win rate.
Many AI models have first-position bias. An AI judge may favor the first answer in a pairwise comparison or the first in a list of options. This can be mitigated by repeating the same test multiple times with different orderings or with carefully crafted prompts. The position bias of AI is the opposite of that of humans. Humans tend to favor the answer they see last, which is called recency bias.
Some AI judges have verbosity bias, favoring lengthier answers, regardless of their quality. Wu and Aji (2023) found that both GPT-4 and Claude-1 prefer longer responses (~100 words) with factual errors over shorter, correct responses (~50 words). Saito et al. (2023) studied this bias for creative tasks and found that when the length difference is large enough (e.g., one response is twice as long as the other), the judge almost always prefers the longer one. Saito et al. (2023) found that humans tend to favor longer responses too, but to a much lesser extent. Both Zheng et al. (2023) and Saito et al. (2023), however, discovered that GPT-4 is less prone to this bias than GPT-3.5, suggesting that this bias might go away as models become stronger.
On top of all these biases, AI judges have the same limitations as all AI applications, including privacy and IP. If you were to use a proprietary model as your judge, you would need to send your data to this model. If the model provider doesn't disclose the training data, you won't know for sure if the judge is commercially safe to use.
Despite the limitations of the AI-as-a-judge approach, its many advantages would suggest that its adoption will continue to grow.
What models can act as judges?
One big question is whether the AI judge needs to be stronger than the model being evaluated. At first glance, a stronger judge makes sense. Shouldn't the exam grader be more knowledgeable than the exam taker? Not only can stronger models make better judgments, but they can also help improve weaker models by guiding them to generate better responses.
You might wonder: If you already have access to the stronger model, why bother using a weaker model to generate responses? The answer is cost and latency. You might not have the budget to use the stronger model to generate all responses, so you use it to evaluate a subset of responses. For example, you may use a cheap in-house model to generate responses and the best commercial model to evaluate one percent of the responses.
The stronger model might also be too slow for your application. You can use a fast model to generate responses while the stronger, but slower, model does evaluation in the background. If the strong model thinks that the weak model's response is bad, remedy actions might be taken, such as updating the response with that of the strong model. Note that the opposite pattern is also common. You use a strong model to generate responses and a weak model running in the background to do the evaluation.
Using the stronger model as a judge poses two challenges. First, the strongest model will be left with no eligible judge. Second, an alternative evaluation method is needed to determine which model is the strongest.
Using a model to judge itself—self-evaluation or self-critique—sounds like cheating, especially because of self-bias. Self-evaluation can be great for sanity checks, however. If a model thinks its response is incorrect, the model might not be that reliable. Beyond sanity checks, asking a model to evaluate itself can nudge the model to revise and improve its responses (Press et al., 2022; Gou et al. 2023; Valmeekam, Marquez, and Kambhampati, 2023). This example shows what self-evaluation might look like:
Prompt [from user]: What's 10+3?
First response [from AI]: 30
Self-critique [from AI]: Is this answer correct?
Final response [from AI]: No it's not. The correct answer is 13.
One open question is whether the judge can be weaker than the model being judged. Some argue that judging is an easier task than generating. Anyone can have an opinion about whether a song is good, but not everyone can write a song. Weaker models should be able to judge the outputs of stronger models.
One exciting research direction is toward small, specialized judges. These judges are trained to make specific judgments, using specific criteria and following specific scoring systems. A small, specialized judge can be more reliable than larger, general-purpose judges for specific judgments.
In production, many teams have already successfully used weaker, cheaper models to evaluate stronger, more expensive models.
If the assumption holds that judging is easier than generating, it's excellent news for users. This suggests we can effectively evaluate the outputs of AI models more powerful than ourselves. Bowman et al. (2022), in "Measuring Progress on Scalable Oversight for Large Language Models," demonstrate that users can collaborate with high-performing AI by probing models for consistency, reviewing multiple outputs for the same query, and synthesizing the results using their judgment.
Another new but promising evaluation approach is comparative evaluation. The core idea behind this approach is that even for outputs where we can't give absolute scores, we can still tell which output is better. For example, even though I might not be able to give a song a concrete score, I can still tell which of two songs I like more.
Comparative evaluation is the opposite of pointwise evaluation. With pointwise evaluation, you evaluate each model or model output independently and then rank them by their scores. With comparative evaluation, you evaluate models against each other and compute a ranking from comparison results. For responses whose quality is subjective, comparative evaluation is typically easier to do than pointwise evaluation.
In AI, comparative evaluation was first used in 2021 by Anthropic to rank different models. It also powers the popular LMSYS Chatbot Arena leaderboard that ranks models using scores computed from pairwise model comparisons from the community.
Many model providers use comparative evaluation to evaluate their models in production. Figure 2 shows an example of ChatGPT asking its users to compare two outputs side by side. These outputs could be generated by different models or by the same model with different sampling variables.
For each request, two or more models are selected to respond. An evaluator, which can be human or AI, picks the winner. These preferential signals can be used not only to rank models but also to train models to better align them with human preference.
Comparative evaluation shouldn't be confused with A/B testing. In A/B testing, a user sees the output from one candidate model at a time. In comparative evaluation, a user sees outputs from multiple models at the same time.
As models become stronger, surpassing human performance, it might become impossible for human evaluators to give model responses concrete scores. Human evaluators might still be able to detect the difference, however, and comparative evaluation might remain the only option. For example, the Llama 2 paper shared that when the model ventures into the kind of writing beyond the ability of the best human annotators, humans can still provide valuable feedback when comparing two answers (Touvron et al., 2023).
Comparative evaluation also aims to capture the quality we care about: human preference. It reduces the pressure to constantly create more benchmarks to catch up with the ever-expanding capabilities of AI. Unlike benchmarks that become useless when model performance achieves perfect scores, comparative evaluations will never get saturated as long as newer, stronger models are being introduced.
Comparative evaluation is relatively hard to game, as there's no easy way to cheat—for example, training your model on reference data. For this reason, many trust the results of public comparative leaderboards more than any other public leaderboards.
I believe that comparative evaluation can provide discriminating signals about models that can't be obtained otherwise. For offline evaluation, it can be a great addition to evaluation benchmarks. For online evaluation, it can be complementary to A/B testing.
An important point to keep in mind when using comparative evaluation is that not all questions should be answered by preference. Many questions should be answered by correctness instead. Imagine asking the model, "Is there a link between cell phone radiation and brain tumors?" and the model presents two options—"Yes" and "No"—for you to choose from. Preference-based voting can lead to wrong signals that, if used to train your model, can result in misaligned behaviors.
Asking users to choose can also cause user frustration. Imagine asking the model a math question because you don't know the answer, and the model gives you two different answers and asks you to pick the one you prefer. If you had known the right answer, you wouldn't have asked the model in the first place.
Evaluating AI models that surpass human expertise in the task at hand presents unique challenges. These challenges only grow as AI becomes more intelligent. However, the three effective strategies presented in this article exist to address these hurdles.
• Functional correctness measures whether AI outputs achieve intended objectives, offering a clear and reliable metric for tasks with measurable outcomes, such as code generation or gamebot performance. Automating this evaluation for more complex tasks that AI can't yet solve end to end, however, remains challenging.
• AI-as-a-judge is the approach of using AI models as evaluators of other models. This approach has only recently become possible as AI models have become powerful enough to do so. Like all other AI applications, AI judges are probabilistic with potential inconsistencies and biases, which make many teams hesitant to use it. Despite its limitations, AI-as-a-judge offers scalability and efficiency, especially when human evaluators are unavailable or their use is impractical.
• Comparative evaluation, which pits models against one another, captures the quality that many application developers care about: human preferences. For many tasks where humans can't effectively evaluate outcomes, they can still differentiate which outcome is better.
Ultimately, business decisions are still made based on returns on investment. Having a systematic, reliable way to evaluate AI applications is crucial in showing the impact of AI while reducing risks. By adopting strategies like those discussed in this article, application developers can build trust in both users and stakeholders and enable more applications to be built in the future.
Chip Huyen works at the intersection of AI and storytelling. Previously, she was with Snorkel AI and Nvidia, founded an AI infrastructure startup (acquired), and taught Machine Learning Systems Design at Stanford. She was a core developer of NeMo, Nvidia's generative AI framework. Her book, Designing Machine Learning Systems, is an Amazon bestseller in AI and has been translated into 10-plus languages. Her latest book, AI Engineering, came out in January 2025.
Copyright © 2025 held by owner/author. Publication rights licensed to ACM.
Originally published in Queue vol. 23, no. 1—
Comment on this article in the ACM Digital Library
Mark Russinovich, Ahmed Salem, Santiago Zanella-Béguelin, Yonatan Zunger - The Price of Intelligence
The vulnerability of LLMs to hallucination, prompt injection, and jailbreaks poses a significant but surmountable challenge to their widespread adoption and responsible use. We have argued that these problems are inherent, certainly in the present generation of models and likely in LLMs per se, and so our approach can never be based on eliminating them; rather, we should apply strategies of "defense in depth" to mitigate them, and when building and using these systems, do so on the assumption that they will sometimes fail in these directions.
Sonja Johnson-Yu, Sanket Shah - You Don't Know Jack About AI
For a long time, it was hard to pin down what exactly AI was. A few years back, such discussions would devolve into hours-long sessions of sketching out Venn diagrams and trying to map out the different subfields of AI. Fast-forward to 2024, and we all now know exactly what AI is. AI = ChatGPT. Or not.
Jim Waldo, Soline Boussard - GPTs and Hallucination
The findings in this experiment support the hypothesis that GPTs based on LLMs perform well on prompts that are more popular and have reached a general consensus yet struggle on controversial topics or topics with limited data. The variability in the applications's responses underscores that the models depend on the quantity and quality of their training data, paralleling the system of crowdsourcing that relies on diverse and credible contributions. Thus, while GPTs can serve as useful tools for many mundane tasks, their engagement with obscure and polarized topics should be interpreted with caution.
Erik Meijer - Virtual Machinations: Using Large Language Models as Neural Computers
We explore how Large Language Models (LLMs) can function not just as databases, but as dynamic, end-user programmable neural computers. The native programming language for this neural computer is a Logic Programming-inspired declarative language that formalizes and externalizes the chain-of-thought reasoning as it might happen inside a large language model.