September 2, 2025
Volume 23, issue 4

Download PDF version of this article PDF

Unsolved Problems in MLOps

Either find a better paradigm or fix the ones we're using now.

Niall Murphy and Todd Underwood

The current state of AI is miraculous. Long-running, difficult problems of AI and CS that were thought to be unsolvable or intractable have fallen under the twin assaults of deep learning and sheer scale. A generation of computer scientists, trained by the reflective prose of Douglas Hofstadter to introspect on cognition's limits, have abandoned that introspection and are frolicking in the success of sheer compute. As various graphs climb up and to the right seemingly without cease, it's hard not to feel a dizzying sense of both wonder and disorder.

For the moment, let's leave aside vitally important questions of AGI (artificial general intelligence), ASI (artificial superintelligence), and social impact. Many will already know where they stand on these issues. Instead, presuming a reasonable use of ML, we ask the question: How can we make it work well—work reliably?

This is a pressing question everywhere but is particularly pressing in MLOps, which is thought of as the act of building, running, deploying, monitoring, managing, and decommissioning models and their associated data.

Yet when we as practitioners take a moment to reflect on how we got here, it's sobering how little we actually understand about how to do the above well. (This is distinct from our understanding of model internals generally, which is currently also at Dark Ages level, though it is somewhat related, as we'll see.) Most of the work placed under the heading of MLOps is similar to nonfunctional requirements in the "classical" non-ML world. Unfortunately, although concepts, frameworks, and actual software tools (and a wide array of them, most of which are known to work well) are available to help accomplish our goals in the classical software world, the ML world has different problems for which much of the existing approaches are not suitable. In other words, we have been disrupted.

At the heart of that disruption are two central facts: that classical systems are (ostensibly) deterministic and ML systems are not, and that data is as much a driver of system behavior as is code or configuration. In the classical world, we rely on the opposite of those propositions every day. Need to check whether a service is up and running well? Send it a GET / or a SELECT 1 or a GetIndex, and test for the expected response. Need to roll out new versions safely? Build a new binary, trickling the code changes through the CI/CD (continuous integration/continuous delivery) system and associated tests, then roll out across your fleet as you watch for discontinuous changes in graphs, just in case. Need to be made aware of when things are going wrong? Set an alert threshold on SLO (service-level objective) violations for connection requests and walk away (hopefully with a pager). Need to fix something quickly because it's dying in prod? Cherry-pick from a particular branch and build a new binary, fast-pushing through the CI/CD system to get it into production.

We can't easily reuse any of those classical approaches in ML.

Compare and Contrast

(Note: For the following discussion, the terminology is that a model is built from data, processed in a training phase, and after training has finished, you have a file, which is used in serving or inference. The model itself has some kind of structure, which will supply answers, given some inputs, spanning from perhaps a single real number to arbitrarily complex text.)

An ML system might send back an arbitrary response to a query (especially true for LLMs, but also true for models), and what's in that arbitrary response depends hugely on the data the model was trained on, which may in turn depend on previous user actions.

Rolling out a new version of a model is often more an exercise in vibes than anything else. At SRECon Americas 2025, Brendan Burns of Microsoft told the audience that Azure uses two ways of validating that new models in the Azure UI are working well: first, a set of LLMs judging the output of the new LLMs, and second, whether or not enough Microsoft employees have hit the equivalent of the "thumbs up" button when exposed to its recommendations. (This approach helps to eliminate outrageous assertions on behalf of the model but does not provide a clear quality gradient.) While Microsoft surely has enough employees for this to be some kind of relevant safeguard, not many other organizations do, and the audience was audibly surprised at Burns's statement that LLMs were judging LLM output. Even when there are stable evaluations (most published benchmarks produce broadly reproducible results on the same model), it doesn't always capture the real-world experience of the same model. The best efforts to evaluate model performance are still imperfect.

For similar reasons, it's quite hard to have meaningful alerting for ML systems. Of course, the standard "is-it-hard-down" style alerts are still relevant and useful. But anything that relies on a quality or behavior threshold is tricky to define. The more complex the model, the more difficult it is to alert on misbehavior, even for something as simple as latency. If the model is making insurance decisions based on a small set of inputs, the resulting page should render within a few seconds, but if the model is summarizing hundreds of pages of text, the likely latency is proportional to the length of the input text.

Some questions—for example, difficult mathematical questions or tough analysis problems—will take a long time to answer, even if the input prompt is very short. Evaluating correctness successfully in real time is almost impossible; at the very least, checking the answer as it's being emitted would presumably consume proportional resources to generating it. Some example questions that might help to illustrate this: What are the key points of this very paragraph? Are Economic Sciences prize winners included in the list of Nobel Prize winners, even though it's technically a different prize?

When it comes to resolving errors, too, presuming you can detect any, there's the question of what you can actually do. In the classical world, you can build a new binary and push it out—with a strong presumption that new code paths will fix bad behavior. On rare occasions, you sometimes re-create configuration, or even restore a backup of "the" database.

In the ML world, the relationship between serving binary, model, and data makes the decision about how and what to revert decidedly hard. In particular, to recycle classical language, if building a new binary is building a serving binary designed to load the model into memory and copy request and response back and forth, there's usually no point since it's such a thin layer. If building a new binary for ML is building a whole new model, then that can cost millions of dollars and take multiple months. Even if "building a new binary" is changing the system prompt or redoing some of the fine tuning, it still might take several days or weeks to make the changes and systematically test them.

We hope that the preceding examples have given you some feeling for the problems of the domain. The examples are far from complete, but now that you have an intuition that things are difficult here, let's look at what we believe are key unsolved problems of MLOps. These are problems actively impeding the safe deployment of ML today, increasing risk for both practitioners and users and highlighting gaps in our intellectual framework around service management and what it means to run computing systems generally.

Measuring Model Quality

We've argued that end-to-end model quality is the only metric that matters for ML reliability. Measuring during model development is comparatively easy. How is it measured today in production?

There are two approaches from the classical world that are typically reused here, though neither of them is entirely satisfactory.

The first approach is a set of replayed queries and surveyed responses to those queries across a mixture of automated and manual actions. (Usually in the classical world the queries and responses are hard-coded.) Although this is done often, making it work is harder than you might think. The questions and responses need to be carefully evaluated for reproducibility and ease of evaluation. For example, if the question is "How many legs do humans have?" and the expected answer is "Two," what happens when a new, more thoughtful version of the model says, "Most humans have two legs, but it's not uncommon for some humans to have only one leg or even none due to accident, illness, or congenital difference." That answer isn't wrong, but it's going to be flagged as a deviation. There are emerging frameworks that try to do this automatically, which is more tractable in spaces where the answers are more straightforward.

This approach takes us part of the way, but the experienced reality of such systems is poor. Any sufficiently complex business domain and quality becomes as hard to measure as the questions are to answer. So, as per the Microsoft story in the previous section, the only thing known to work for sure is actual feedback, in production use, of users—clicks, queries, copy-and-pastes, etc.—and in a real time format. Let's face the reality that this is, in essence, outsourcing model quality assurance to the users of the system, which is the opposite of what we strive to do in the classical world.

The second approach we reuse is the well-known canarying technique for testing a new binary/application/etc., even when a definitive testing suite is not available. As outlined, it involves shipping the new model to production, exposing it to a small amount of production traffic, watching its behavior closely, and increasing that traffic fraction over time according to some reasonable schedule. Watch for long enough and you can have quite a high confidence that the model is behaving correctly before exposing it to the full production onslaught.

In the ML world, however, this technique has two difficulties that prevent it from solving our problem: First, model behavior is very sensitive to user behavior and therefore to time, so if your model needs to run for a long while before you assemble definitive evidence on its behavior, it ends up interfering with your rollout schedule. Additionally, it's very common to have multiple models in action at once, and these models can end up influencing each other or make it impossible to compare with a stable baseline.

Consequences of poor measurement

There are, unfortunately, numerous cases of model-quality problems escaping into production and occasionally having serious effects. For example, according to OpenAI, one particular release, which was intended to "improv[e] the model's default personality to make it feel more intuitive and effective across a variety of tasks," ended up favoring responses "that were overly supportive, but disingenuous"—the so-called sycophantic release. It's important to note that there were multiple contributing factors to this outcome, including existing user feedback that validated short-term model behavior, but you should take this as a reminder that good quality control is hard, even when you have quite a sophisticated assessment system.

There are related issues worth considering even when you're not at foundation model scale or an in-house consumer of an in-house model.

For example, as an end consumer, if you partake of model services over an API, the lack of model versioning means you can't know if a different answer to the same question is because of model data changes or model internal state changes. For LLMs, the system prompt could also get changed—currently generally kept secret by design, unless you are Pliny the Prompter—and this is quite likely to change the response of the model. None of this is easily detectable by the bulk of model consumers.

Additionally, model-quality variance, as opposed to pure behavior change, is also a significant problem for API consumers. If your provider can't keep model quality within acceptable limits, you'll need to instantiate a model-quality verification process. (You don't even have to presume malfeasance, just that they don't know what you care about.) Multiply this work across all consumers of a model and it very quickly starts to incur significant overhead—imagine the size of the additional work that would be done in the world if everyone had to perform their own food-quality analysis, search-query analysis, and so on.

The practical upshot is that if you are a provider of model services and you lack proper testing, you spend a lot more time on debugging, incidents, and postproduction unplanned work in general and have bad failures in production—sometimes headline-generating ones.

Given all this, we say the unsolved problems are:

Finding a way to bound the stochasticism of a model response in such a way that it's reasonable for testing
Finding a good approach in canarying for quality control across multiple models and preserving a baseline

Model provenance and versioning

Another pervasive problem is how to do model versioning and understand provenance.

Versioning of files in the classical world is a solved problem and has been for a long while. To do it, you need a tag, handle, integer, or metadata of some kind associated with a file. The mapping from a particular version to the particular sequence of bytes composing that file can be handled either in the file, in some associated metadata, in a separate database, or by any number of arbitrary mechanisms.

Most software engineers are familiar with software version control in general or tools to accomplish same—these are not new and unfamiliar ideas. But they are simply much less prevalent in MLOps than they should be. There is no guarantee that the system named "GPT-4o" that answers the question at 16:18 will be the same system named "GPT-4o" at 18:16, for a variety of reasons, including the provider shipping a new trained model, changing the system prompt, implementing new trust and safety controls, or any number of other changes.

Definitionally, versioning of models is being able to identify the dataset trained on, the set (if any) of post-training transformations, the model produced, the associated policies (including filters, blocks, system prompt), and even changes in architecture that should be model-neutral but occasionally aren't. (Note that we are a long way from being able to implement the strong convention that pertains elsewhere in software in this domain—the division into major, minor, and patch-level versioning that enables people to tell that 1.5 is likely to be less capable than 2.0, even though 1.5 will probably be more stable.)

As it happens, this closely overlaps with the question of model provenance. Let's begin with the basics: Anyone doing model training has to organize their datasets, track what data fed what training run, and track permissible and/or suitable uses, taking in legal compliance, problem domain relevance, quality scoring, and so on.

Most organizations are not doing this carefully or sufficiently publicly, and almost none has it automated or standardized in any meaningful way. A model trained in America for use in France would need to select training data representative of French use cases, be legally permitted to be used there, be responsive to French concerns, and so on. We opine that today the practical answer to this is "Don't treat France separately in any meaningful way and cross your fingers against lawsuits," with a side order of "Don't offer service in France." There are emerging frameworks for managing data provenance, but they are frameworks and toolkits providing the ability to trace and segment, not actual suitable segments themselves.

Data provenance does occasionally matter in non-ML cases, but no well-accepted paradigm or software suggests itself as a useful precursor.

In effect, this once more pushes the work of model validation onto the model user, which is (as the number of consumers scales) quite inefficient. Furthermore, one interesting trap is that if you can't handle provenance or versioning well, your tactical flexibility seems higher—hey, none of these annoying barriers to pushing stuff to production!—but you generally end up paying the cost post-deployment and in unplanned work. (In that sense, growth is covering up large problems and creating more of its own.)

Again, given all this, we claim the unsolved problems are:

Providing widespread tooling support for versioning and a widely agreed set of things that constitute a worthy-of-versioning set
Arriving at a good way to do data-set management more broadly (this is closely related to the first unsolved problem)
Persuading others that versioning and better approaches for quality control are worthwhile to pursue and worthwhile exposing to the user (this is a human problem)

Monitoring and observability

Monitoring and observability are deeply connected with the preceding section.

Monitoring model quality in production is what the industry should be doing, but, in large part, isn't. Half of the respondents in a recent survey of ML practitioners indicated that they did not monitor the performance of their model in production, according to a 2024 report on the State of Production ML by the Institute of Ethical AI and Machine Learning. This is, on some level, astonishing; again, in the classical world, leaving your app unmonitored, or effectively monitored only by customers, is clear professional negligence. But model builders are really struggling with this—monitoring and observability is the single largest problem category cited in the aforementioned State of Production ML report.

Let's assume for now that the major reason for this is the already stated difficulty with establishing model quality when being fed novel queries by actual users. Note that even with a realtime user feedback stream in place, you are in the position of monitoring only the output, not the constituent parts contributing to the outputs.

That isn't the end of the questions that have to be asked.

For a start, who does that monitoring, and who responds to the related alerts? The current answer is the users and model builder staff, but many larger organizations have many teams who could conceivably be involved. (At one point Google chose to have SREs—site reliability engineers—do model-quality tests, but they lived with the model-builder teams for a long while.)

Furthermore, alerting itself is specifically hard in this domain because both "standard" threshold alerting and SLO alerting need to be tied to a specific metric, and the thresholds of that metric need to stay stable enough to alert on them. Both of these are a challenge in this brave new world. If it's too hard to alert on business metrics, you can alert on the state of infrastructure instead—but that will miss many things that matter.

Either way, you can see why half of the community is finding this difficult.

Given all this, we claim the unsolved problems are:

Arriving at a broadly understood consensus about who and how to do monitoring, organizationally speaking
Establishing some best practices so we can make it easy for the nonmonitoring greater than 50 percent of practitioners to start monitoring
Aligning with emerging industry-practice approaches in other contexts, particularly around efficiency of metrics transmission and storage; for example, it might be possible to provide model inspectability by recording every model weight lookup, but this would be immensely costly and present its own significant search problem

Efficiency, cost management, and stranded capacity

At this particular moment, open-market GPU hardware that specifically enables AI has two methods of pricing: new, more performant, and nonlinearly expensive versus older, more reliable, and with commodity pricing. Providers using the open-market GPU hardware often spend a lot of money to get access to the latest technology—in other words, prototype pricing structure—and as a result, many of those accelerators are hideously expensive. (For example, in 2024, NVIDIA's DGX B200 with eight cards was about $500,000.) As a result, using those resources efficiently can make or break a company.

The classical world actually has good solutions for this type of problem. Not every optimization trick we have is available for every provider/problem combination, but in general the techniques are broadly applicable. Foremost among these are load balancing and query cost estimation. Load balancing permits constructing a set of machines that are equally capable of answering some query class and therefore helps with both efficient usage and reliability—for example, there is no machine specifically dedicated to Spanish queries, so if there aren't many Spanish queries for some reason, the resources aren't lying idle; also, if a particular machine is broken or has some other problem, it can be removed from the set of generally available machines. In order to do this well, though, you have to have some idea of how difficult a query is going to be in advance. This is called query cost estimation, and there are a number of computationally cheap and effective approaches in the classical world (e.g., URL path or SQL query length).

Those classical techniques, however, fall apart in the more complex ML world. Cost estimation in LLM serving is extremely difficult. First, given that tokenization is a huge part of how LLMs understand their input, even just figuring out the number of tokens in a prompt represents a significant portion of total prompt-processing cost. This means that you might get an accurate proxy for total cost at the end, but it's not cheap. Second, it's even harder to figure out how many tokens a response to a prompt will be. It varies by model, by context length, and by system prompt. But those are not even the largest part of the query routing problem.

The problem, to use the previous terminology, is that there are multiple, disjoint query classes that have to be routed separately. This is largely an architectural requirement imposed by the fact that models are quite large and require specialized hardware to run on, but it is nonetheless necessary. Model proliferation means that a provider may have many different models optimized for a number of different use cases: for example, smaller and larger versions of the same model architecture, deployments with different maximum context length so users can optimize for cost or performance, etc.

Moreover, if there is any kind of prompt caching, prioritization of certain requests, or long-running queries (which can be up to tens of minutes long in some cases), these factors impose additional constraints on query routing—essentially so many constraints have to be honored that the balancing "space" is partitioned to impractical sizes. As a result, the classical approaches can't be recycled easily—simple token-based load balancing won't correctly estimate the costs, and the available pools to balance between are partitioned too finely for real efficiency. Thus, every additional pool that is created naturally creates more stranded GPU resources, and the more instances you have of almost the same model, the more money you're losing.

It is possible to partially address the problem of stranded resources by using them for batch processing so the GPUs are used for something, but this is hardly a real resolution. Historical data at a number of large compute organizations familiar to the authors indicates clearly that batch demand never grows large enough to substantially improve efficiency.

Given all this, we claim the unsolved problems are:

Arriving at a practical way to do some kind of query cost estimation
Developing better load-balancing techniques
Performing better capacity planning and capacity management, such that the footprint problem and the proliferation problem can be reasonably managed

At the moment ad-hoc, proprietary approaches dominate. It would be preferable if we weren't all solving this in our own individual ways.

Data leakage, injection, and security with LLMs

In this case, data leakage is defined as "the LLM outputting to the user something it shouldn't, because the data in question is either confidential, inappropriate to release to that person/API consumer, or has some other constraint (such as copyright)." Injection is sending something to the model that causes it to behave in ways not desired by the model provider (often, as described previously, outputting something to that user that it shouldn't). This is also known as jailbreaking.

This problem inarguably has an architectural component, but it also has an operational component, and it's under this heading that we mention it. Today, the only technique we know that works to prevent inappropriate data leakage from an LLM is to remove what you care about from the training data. (That won't prevent it from being emitted, since of course the LLM could hallucinate the same string in some circumstances, but it will go a long way toward moderating it.) This is in the case where model behavior is not being subverted! Conversely, jailbreaking and injection are currently not preventable in principle, and the only effective practical response relies on defense in depth such as strong live monitoring, fast mitigation rules, and failover.

Ultimately, LLMs are difficult to control. Data controls might have some prospect of preventing leakage at emission time but obviously limit the full potential success of the model. A number of operational approaches are possible—specifically implementing some kind of filter at egress time and looking for bad classes of data there to remove them—but, again, we are addressing this in proprietary ways (see, for example, Anthropic's Responsible Scaling Policy and Opus 4, including AI Safety Level 3 controls.), and there isn't yet a body of practice here.

Given all this, we claim that the unsolved problems are:

Inventing a way to prevent things being emitted at source rather than filtered out, presuming they need to stay in the training data
Generalizing or otherwise making universally accessible a way to safely filter inbound requests such that jailbreaking is impossible or effectively rendered impractical

Not Unsolved Problems

There are a wide variety of problems in MLOps that don't rise to the level of being an "unsolved problem." For example, the training phase is often fragile to network-performance problems in particular—losing a single link causing a training slowdown of 50 percent is not unheard of—but there are a variety of known solutions for this. Unfortunately, most of those involve provisioning redundant/additional links or making the model state more redundant and distributed, both of which add significant costs. There's no inherently novel work that has to be done here to make the situation better. (We, of course, welcome novel work to correct such problems cheaply.)

Summary

As discussed at the beginning of this article, the excitement with AI is carrying us along in a big wave, but the practitioners whose job it is to make this all work are scrambling behind the scenes, often more in dread than excitement. In some cases, they are using outdated techniques; in others, approaches that only work for now; and every so often they are doing nothing at all in order to meet significant operational, technical, and business challenges.

In MLOps terms, it sometimes feels that we are using older paradigms to manage a thoroughly new situation, and it's not entirely clear that we really see it like this. We should be casting about for either a better paradigm or a better patching-up of the existing paradigms than is available today. Regardless, we hope that the summary of the problems presented here is a useful stimulant to people attempting to think about them more holistically and, hopefully, helps to provide some answers.

Acknowledgments

Special thanks to John Lunney, Demetrios Brinkmann, and Maria Jackson for their support in writing this article.

Niall Richard Murphy has worked in computing infrastructure since the mid-1990s and has been employed by every major cloud provider (specifically Amazon, Google, and Microsoft) from their Dublin, Ireland, offices in a variety of roles from individual contributor to director. He is currently CEO/founder of Stanza Systems, a small startup in the ML/AI/reliability space. He is the instigator, co-author, and editor of multiple award-winning books on networking, reliability, and machine learning, and he is probably one of the few people in the world to hold degrees in computer science, mathematics, and poetry studies. He lives in Dublin with his wife and two children.

Todd Underwood leads reliability at Anthropic, a company trying to create AI systems that are safe, reliable, and beneficial to society. Prior to that he briefly led reliability for the Research Platform at Open AI. Before that he was a senior engineering director at Google, leading ML capacity engineering at Alphabet. He also founded and led ML Site Reliability Engineering, a set of teams that build and scale internal and external AI/ML services. He was also the site lead for Google's Pittsburgh office. Along with several colleagues, he published Reliable Machine Learning: Applying SRE Principles to ML in Production (O'Reilly Press, 2022). Underwood has a B.A. in philosophy from Columbia University and an M.S. in computer science from the University of New Mexico.

Originally published in Queue vol. 23, no. 4—
Comment on this article in the ACM Digital Library