The Morning Paper

  Download PDF version of this article PDF

Putting Machine Learning into Production Systems

Data validation and software engineering for machine learning

Adrian Colyer

This time around with The Morning Paper I've chosen two papers that address different aspects of putting machine learning into production systems. In "Data Validation for Machine Learning," Breck et al. share details of the pipelines used at Google to validate petabytes of production data every day. With so many moving parts it's important to be able to detect and investigate changes in data distributions before they can impact model performance. As a bonus, the data-validation library at the core of Google's approach has also been made available in open source so that you can experiment with it, too (https://github.com/tensforflow/data-validation).

"Software Engineering for Machine Learning: A Case Study" shares lessons learned at Microsoft as machine learning started to pervade more and more of the company's systems, moving from specialized machine-learning products to simply being an integral part of many products and services. This means that software-engineering processes and practices on those projects have had to adapt. This paper demonstrates once again the importance of a rock-solid data pipeline, as well as some of the unique challenges that machine learning presents to development projects.

 

Data Validation for Machine Learning

Breck, et al., SysML'19 (Conference on Systems and Machine Learning)

https://www.sysml.cc/doc/2019/167.pdf

(alternate link: https://mlsys.org/Conferences/2019/doc/2019/167.pdf)

 

Previously in The Morning Paper we looked at continuous integration testing of ML (machine learning) models, but arguably even more important than the model is the data. Garbage in, garbage out.

 

In this paper we focus on the problem of validation the input data fed to ML pipelines. The importance of this problem is hard to overstate, especially for production pipelines. Irrespective of the ML algorithms used, data errors can adversely affect the quality of the generated model.

 

Breck et al. describe the data-validation pipeline deployed in production at Google, "used by hundreds of product teams to continuously monitor and validate several petabytes of production data per day." That's trillions of training and serving examples per day, across more than 700 ML pipelines—more than enough to have accumulated some hard-won experience on what can go wrong and the kinds of safeguards it is useful to have in place!

 

What could possibly go wrong?

The motivating example is based on an actual production outage at Google and demonstrates a couple of the trickier issues: feedback loops caused by training on corrupted data; and distance between data providers and data consumers.

An ML model is trained daily on batches of data, with real queries from the previous day joined with labels to create the next day's training data. Somewhere upstream, a data-fetching RPC (remote procedure call) starts failing on a subset of the data and returns -1 (error code) instead of the desired data value. The -1 error codes are propagated into the serving data and everything looks normal on the surface since -1 is a valid value for the int feature. The serving data eventually becomes training data, and the model quickly learns to predict -1 for the feature value. The model will now underperform for the affected slice of data.

 

This example illustrates a common setup where the generation (and ownership!) of the data is decoupled from the ML pipeline... a lack of visibility by the ML pipeline into this data generation logic except through side effects (e.g., the fact that -1 became more common on a slice of the data) makes detecting such slice-specific problems significantly harder.

 

Errors caused by bugs in code are common and tend to be different from the types of errors commonly considered in the data-cleaning literature.

 

Integrating data validation in ML pipelines

Data validation at Google is an integral part of ML pipelines, as shown in the following figure.

 

Pipelines typically work in a continuous fashion with the arrival of a new batch of data triggering a new run. The pipeline ingests the training data, validates it, sends it to a training algorithm to generate a model, and then pushes the trained model to a serving infrastructure for inference.

The data-validation stage has three main components: the data analyzer computes statistics over the new data batch; the data validator checks properties of the data against a schema; and the model unit tester looks for errors in the training code using synthetic data (schema-led fuzzing).

 

Testing one batch of data

Given a single batch of incoming data, the first question to answer is whether or not it contains any anomalies. If so, on-call will be alerted to kick-start an investigation.

 

We expect the data characteristics to remain stable within each batch, as the latter corresponds to a single run of the data-generation code. We also expect some characteristics to remain stable across several batches that are close in time, since it is uncommon to have frequent drastic changes to the data-generation code. For these reasons, we consider any deviation within a batch from the expected data characteristics, given expert domain knowledge, as an anomaly.

 

The expected data characteristics are captured by a schema, as in the following figure.

 

Constraints specified in the schema can be used to ensure that a certain feature is present (for example), or contains one of an expected set of values, and so on.

An initial version of the schema is synthesized automatically, after which it is version controlled and updated by the engineers. With an initial schema in place, the data validator recommends updates as new data is ingested and analyzed. For example, given the training data on the left in the following figure, the schema on the right is derived.

 

If some data arrives with a previously unseen value for event, then the user will be prompted to consider adding the new value to the domain.

 

We expect owners of pipelines to treat the schema as a production asset at par with source code and adopt best practices for reviewing, versioning, and maintaining the schema.

 

Detecting skew

Some anomalies show up only when comparing data across different batches—for example, skew between training and serving data:

• Feature skew occurs when a particular feature assumes different values in training versus serving time. For example, a developer may have added or removed a feature. Or harder to detect, data may be obtained by calling a time-sensitive API, such as retrieving the number of clicks so far, and the elapsed time could be different in training and serving.

• Distribution skew occurs when the distribution of feature values over a batch of training data is different from that seen at serving time. For example, sampling of today's data is used for training the next day's model, and there is a bug in the sampling code.

• Scoring/serving skew occurs when the way results are presented to the user can feed back into the training data. For example, scoring 100 videos, but presenting only the top 10. The other 90 will not receive any clicks.

Google's ML serving infrastructure logs samples of the serving data, and this is imported back into the training pipeline where the data validator uses it to detect skew.

To detect feature skew, the validator does a key-join between corresponding batches of training and serving data followed by a featurewise comparison.

To detect distribution skew, the distance between the training and serving distributions is used. Some distance is expected, but if it is too high an alert will be generated. There are classic distance measures such as KL (Kullback-Leibler) divergence and cosine similarity, but product teams had a hard time understanding what they really meant and, hence, how to tune thresholds.

In the end Google settled on using as a distance measure the largest change in probability for any single value in the two distributions. This is easy to understand and configure (e.g., "allow changes of up to 1% for each value"), and each alert comes with a "culprit" value that can be used to start an investigation. Going back to our motivating example, the highest change in frequency would be associated with -1.

 

Model unit testing

Model unit testing is a little different, because it isn't validation of the incoming data but rather validation of the training code to handle the variety of data it may see. Model unit testing would fit very nicely into the CI (continuous integration) setup addressed previously.

 

...[training] code is mostly a black box for the remaining parts of the platform, including the data-validation system, and can perform arbitrary computations over the data. As we explain below, these computations may make assumptions that do not agree with the data and cause serious errors that propagate through the ML infrastructure.

 

For example, the training code may apply a logarithm over a number feature, making the implicit assumption that the value will always be positive. These assumptions may well not be present in the schema (that just specifies an integer feature). To flush these out, the schema is used to generate synthetic inputs in a manner similar to fuzz testing, and the generated data is then used to drive a few iterations of the training code.

 

In practice, we found that fuzz-testing can trigger common errors in the training code even with a modest number of randomly-generated examples (e.g., in the 100s). In fact, it has worked so well that we have packaged this type of testing as a unit test over training algorithms, and included the test in the standard templates of our ML platform.

 

Experiences in production at Google

Users do take ownership of their schemas after the initial generation, but the number of edits required is typically small, as shown in the following figure.

 

...anecdotal evidence from some teams suggest a mental shift towards a data-centric view of ML, where the schema is not solely used for data validation but also provides a way to document new features that are used in the pipeline and thus disseminate information across the members of the team.

 

The following table shows the kinds of anomalies detected in a 30-day period, and whether or not the teams took any action as a result. Product teams fix the majority of detected anomalies.

 

Furthermore, six percent of all model-unit testing runs find some kind of error, indicating that either training code had incorrect assumptions or the schema was underspecified.

 

Related work

Finally, I just want to give a quick call out to the related work section in the paper (§7), which contains a very useful summary of works in the data-validation, monitoring, and cleaning space.

Google has made its data-validation library available as open-source software at https://github.com/tensorflow/data-validation.

 

Software Engineering for Machine Learning: A Case Study

Amershi, et al., ICSE'19 (International Conference on Software Engineering)

https://www.microsoft.com/en-us/research/publication/software-engineering-for-machine-learning-a-case-study/

 

Previously in The Morning Paper we've looked at the spread of machine learning through Facebook and Google and some of the lessons learned together with processes and tools to address the challenges. Today it's Microsoft's turn. More specifically, we'll look at the results of an internal study with more than 500 participants designed to figure out how product development and software engineering is changing at Microsoft with the rise of AI and ML.

 

...integration of machine learning components is happening all over the company, not just on teams historically known for it.

 

A list of application areas includes search, advertising, machine translation, predicting customer purchases, voice recognition, image recognition, identifying customer leads, providing design advice for presentations and word processing documents, creating unique drawing features, health care, improving gameplay, sales forecasting, decision optimization, incident reporting, bug analysis, fraud detection, and security monitoring.

As you might imagine, these are underpinned by a variety of different ML models. The teams doing the work are also varied in their makeup, some containing data scientists with many years of experience, and others just starting out. In a manner that's reminiscent of the online experimentation evolution model at Microsoft we looked at previously, data science moves from a bolt-on specialized skill to a deeply integrated capability over time:

 

Some software teams employ polymath data scientists, who "do it all," but as data science needs to scale up, their roles specialize into domain experts who deeply understand the business problems, modelers who develop predictive models, and platform builders who create the cloud-based infrastructure.

 

To help spread these skills through the company, a variety of tactics are used: a twice-yearly internal conference on machine learning and data science dedicates at least one day to the basics of technologies, algorithms, and best practices; internal talks are given year round on engineering details behind projects and cutting-edge advances from academic conferences; several teams host weekly open forums on ML and deep learning; and there are mailing lists and online forums with thousands of participants.

A survey informed by conversations with 14 experienced ML leaders within Microsoft was sent to 4,195 members of those internal mailing lists, garnering 551 replies. Respondents were well spread across data and applied science (42%), software engineering (32%), program management (17%), research (7%), and other (1%). Of the 551 respondents, 21% were managers, and the rest were individual contributors.

 

A general process

The generic ML process looks like the following figure.

 

(Enlarge)

That diagram is pretty self-explanatory, so I won't spell out all of the individual stages.

 

For simplicity the view in Figure 1 is linear, however, machine learning workflows are highly non-linear and contain several feedback loops. For example, if engineers notice that there is a large distribution shift between the training data and the data in the real world, they might want to go back and collect more representative data and rerun the workflow... This workflow can become even more complex if the system is integrative, containing multiple ML components which interact together in complex and unexpected ways.

 

Learnings and emerging best practices

• Having a seamless development experience covering (possibly) all the different stages in the process outlined here is important to automation. But getting there is far from easy.

 

It is important to develop a "rock solid data pipeline, capable of continuously loading and massaging data, enabling engineers to try out many permutations of AI algorithms with different hyper-parameters without hassle."

 

• IDEs with visual tools are useful when starting out with machine learning, but teams tend to outgrow them with experience.

• The success of ML-centric projects depends heavily on data availability, quality, and management.

 

In addition to availability, our respondents focus most heavily on supporting the following data attributes: "accessibility, accuracy, authoritativeness, freshness, latency, structuredness, ontological typing, connectedness, and semantic joinability."

 

• Microsoft teams found a need to blend traditional data management tools with their ML frameworks and pipelines. Data sources are continuously changing, and rigorous data versioning and sharing techniques are required. Models have a provenance tag explaining which data it has been trained on and which version of the model was used. Data sets are tagged with information about where they came from and the version of the code used to extract it.

• ML-centric software also sees frequent revisions initiated by model changes, parameter tuning, and data updates, the combination of which can have a significant impact on system performance. To address this, rigorous rollout processes are required.

 

... [teams] developed systematic processes by adopting combo-flighting techniques (i.e., flighting a combination of changes and updates), including multiple metrics in their experiment score cards, and performing human-driven evaluation for more sensitive data categories.

 

• Model building should be integrated with the rest of the software development process, including common code repositories and tightly coupled sprints and standups.

• The support a team requires changes according to their level of experience with ML, but regardless of experience levels, data availability, collection, cleaning, and management, support remains the number one concern.

 

(Enlarge)

 

The big three

We identified three aspects of the AI domain that make it fundamentally different than prior application domains. Their impact will require significant research efforts to address in the future.

 

1. Discovering, managing, and versioning the data needed for machine-learning applications is much more complex and difficult than other types of software engineering. "While there are very well-designed technologies to version code, the same is not true for data..."

2. Model customization and model reuse require very different skills from those typically found in software teams ("you can't simply change parameters with a text editor").

3. AI components are more difficult to handle as distinct modules than traditional software components—models may be "entangled" in complex ways and experience non-monotonic error behavior.

 

While the first two points are self-explanatory, the third warrants a little more unpacking.

 

Maintaining strict module boundaries between machine learned models is difficult for two reasons. First, models are not easily extensible. For example, one cannot (yet) take an NLP model of English and add a separate NLP model for ordering pizza and expect them to work properly together... Second, models interact in non-obvious ways. In large scale systems with more than a single model, each model's results will affect one another's training and tuning processes.

 

Under these conditions, even with separated code, one model's effectiveness can change as a result of changes in another model. This phenomenon is sometimes known as component entanglement and can lead to non-monotonic error propagation: Improvements in one part of the system may actually decrease the overall system quality.

 

Adrian Colyer is a venture partner with Accel in London, where it's his job to help find and build great technology companies across Europe and Israel. (If you're working on an interesting technology-related business, he would love to hear from you at [email protected].) Prior to joining Accel, he spent more than 20 years in technical roles, including CTO at Pivotal, VMware, and SpringSource.

Copyright © 2019 held by owner/author. Publication rights licensed to ACM.

 

Reprinted with permission from https://blog.acolyer.org.

 

acmqueue

Originally published in Queue vol. 17, no. 4
Comment on this article in the ACM Digital Library





More related articles:

Divyansh Kaushik, Zachary C. Lipton, Alex John London - Resolving the Human-subjects Status of Machine Learning's Crowdworkers
In recent years, machine learning (ML) has relied heavily on crowdworkers both for building datasets and for addressing research questions requiring human interaction or judgment. The diversity of both the tasks performed and the uses of the resulting data render it difficult to determine when crowdworkers are best thought of as workers versus human subjects. These difficulties are compounded by conflicting policies, with some institutions and researchers regarding all ML crowdworkers as human subjects and others holding that they rarely constitute human subjects. Notably few ML papers involving crowdwork mention IRB oversight, raising the prospect of non-compliance with ethical and regulatory requirements.


Harsh Deokuliar, Raghvinder S. Sangwan, Youakim Badr, Satish M. Srinivasan - Improving Testing of Deep-learning Systems
We used differential testing to generate test data to improve diversity of data points in the test dataset and then used mutation testing to check the quality of the test data in terms of diversity. Combining differential and mutation testing in this fashion improves mutation score, a test data quality metric, indicating overall improvement in testing effectiveness and quality of the test data when testing deep learning systems.


Alvaro Videla - Echoes of Intelligence
We are now in the presence of a new medium disguised as good old text, but that text has been generated by an LLM, without authorial intention—an aspect that, if known beforehand, completely changes the expectations and response a human should have from a piece of text. Should our interpretation capabilities be engaged? If yes, under what conditions? The rules of the language game should be spelled out; they should not be passed over in silence.


Edlyn V. Levine - Cargo Cult AI
Evidence abounds that the human brain does not innately think scientifically; however, it can be taught to do so. The same species that forms cargo cults around widespread and unfounded beliefs in UFOs, ESP, and anything read on social media also produces scientific luminaries such as Sagan and Feynman. Today's cutting-edge LLMs are also not innately scientific. But unlike the human brain, there is good reason to believe they never will be unless new algorithmic paradigms are developed.





© ACM, Inc. All Rights Reserved.