Download PDF version of this article PDF

Improving Testing of Deep-learning Systems

A combination of differential and mutation testing results in better test data.

Harsh Deokuliar, Raghvinder S. Sangwan, Yoaukim Badr, Satish M. Srinivasan

AI (artificial intelligence) and ML (machine learning) are finding applications in many domains. With their continued success, however, come significant challenges and uncertainties. These include:

  • How much data is enough to train production-ready AI systems, and is it even feasible?
  • Is the data representative of the complete distribution of the problem being solved?
  • Are the results of the system transparent and explainable?
  • How is testing defined in the realm of AI systems?
  • Are AI systems truly usable in safety-critical environments?
  • What are the boundaries of the ethical use of AI systems?

    This article examines testing in the realm of AI systems, focusing on one aspect of this challenge: namely, the quality of the test data (data on which an ML model is evaluated) in deep-learning systems. These systems, a subset of ML, are data-driven, and it is critical that after training these systems, they are evaluated on a test dataset that is a diverse representation of their training data distribution. Often, the test data might not have a balanced representation, leading to incorrect performance conclusions on these models.

    Differential testing was used to generate test data to improve the diversity of data points in the test dataset; then mutation testing was used to check the quality of the test data in terms of diversity. The differential testing was done using DeepXplore3 and mutation testing using DeepMutation.2 Combining differential and mutation testing in this fashion improves the mutation score, a test-data quality metric, indicating overall improvement in testing effectiveness and quality of the test data.


    Combining Differential and Mutation Testing

    DeepXplore is a differential testing technique that uses differences in decision boundaries of multiple models for test-data generation. This enables it to discover many errors in behaviors of DNN (deep neural network) models. Using gradient ascent on test data to create data points that lie on the decision boundary of DNN models, it solves a joint optimization function to improve neuron coverage and correct a number of erroneous behaviors.

    Mutation testing, a well-established technique for testing software systems, introduces mutants (bugs/faults) into a system to check if these mutants are correctly identified when the system is tested. DeepMutation, a mutation testing framework for deep-learning systems, achieves the same purpose through a collection of data, program, and model mutation operators that are used to inject errors into DNN models. The extent to which the implanted flaws could be recognized by executing these models on a test dataset can be used to assess the quality of test data. Figure 1 shows the general workflow of mutation testing.

    Improving Testing of Deep-learning Systems

    As shown in the figure, the complete test dataset T is executed against deep-learning system S, and only a subset of tests that pass T' are used for mutation testing. All mutants in S' are executed on T', and when the test result for a mutant s'S' is different from that of S, then s' is killed; otherwise, s' survives. The mutation score is calculated as the ratio of killed mutants to all the generated mutants (i.e., number of mutants killed / total mutants), which indicates the quality of the test dataset. The higher the score, the better it is.

    The mutation score evaluates how well the test data covers mutated models in terms of target class variety. To improve this coverage, the test dataset T is augmented by generating additional test cases using DeepXplore. Figure 2 shows the combined workflow when using differential and mutation testing.

    Improving Testing of Deep-learning Systems


    Experiments and Results

    The combined testing approach was used to run several experiments on the MNIST dataset,1 which contains handwritten digit images from 0 to 9. The dataset contains 60,000 training samples and 10,000 test samples. The experiments were run on three DNN models used in the DeepXplore study:

  • Model 1. Contains two conv2D layers and a Maxpooling2D layer. Conv2D layer 1 contains four filters with a 5*5 kernel, and conv2D layer 2 contains 12 filters with a 5*5 kernel. The single Maxpooling2D layer contains a 2*2 kernel. This is followed by a flatten layer and a dense layer with 10 units.

  • Model 2. Contains two conv2D layers and two Maxpooling2D layers. Conv2D layer 1 contains six filters with a 5*5 kernel, and conv2D layer 2 contains 16 filters with a 5*5 kernel. The Maxpooling2D layer contains a 2*2 kernel. This is followed by a flatten layer and two dense layers with 84 and 10 units.

  • Model 3. Contains the same structure of Model 2 but with three dense layers with 120, 84, and 10 units, respectively.


    The model parameters are summarized in table 1.

    Improving Testing of Deep-learning Systems

    The experiments ran iteratively on subsets of training and test data: 5,000 images used as training data, and models evaluated on 1,000 images of the test data. Mutation models were created using 13 mutation operators on the source-data level, as well as the model level. These mutation operators are summarized in table 2.

    Improving Testing of Deep-learning Systems

    A difference of 20 percent or less was the threshold between the accuracy of the mutated model and the original model on the test dataset. This threshold determined whether to use the mutation model for quantifying the quality of the test data.

    These experiments led to a total of 12x10 = 120 iterations for each model, where the training data and testing data were different for each iteration (12 subsets of training data consisting of 5,000 images each from the 60,000 images in the MNIST training dataset; 10 subsets of test data consisting of 1,000 images from the 10,000 images in the MNIST test dataset). The mutation score was calculated for each iteration and then averaged for each model over the 120 iterations. Table 3 shows the average mutation score for each model.

    Improving Testing of Deep-learning Systems

    Next, differential testing was used to generate additional test cases, which were added to the existing test dataset; then the mutation testing experiments were rerun to see if the generated test cases improved the mutation score and, therefore, the quality of test data and the testing effectiveness. To generate the test dataset, three modifications (occlusion, blackout, and light transformation) were used on existing data points for which the models returned different outputs.

    For these experiments, the number of random seed inputs was 500, the number of gradient ascent iterations was 10, the neuron activation threshold was 0.25, and the gradient ascent was done only on model 1. Also, the neuron coverage and differential data hyperparameters were set to 1 (the joint optimization problem for data generation).

    Once the test cases were generated, a manual inspection was performed to remove unidentifiable generated images. The reasoning behind this comes from the fact that AI is nothing but an imitation of human actions, and if humans cannot identify an image, then neither would an AI system. Table 4 provides the number of test cases generated.

    Improving Testing of Deep-learning Systems

    The new test cases were further classified into three categories (i.e., test cases for each model). For each model, generated test cases were added that were predicted incorrectly by that model. This was because any generated test case can be considered as a corner-case data point, since these data points are generated based on differences in decision boundaries of the model. For example, in figure 3, the second image in the sequence is a differential test data point, and the third image is a modification of the test data point. In this case, the generated point was added to the test data of model 1, because it incorrectly predicts that point as 8 instead of 4, making it a corner-case point for that model. Table 5 shows the counts of generated test data for each model.

    Improving Testing of Deep-learning Systems


    Improving Testing of Deep-learning Systems

    Finally, the experiments were run using mutation testing for all three models, with the generated test cases as part of the test set to check if they truly improved the average mutation scores. A similar iterative approach to the previous experiment used 5,000 training data samples, but this time the test dataset had samples of 1,105, 1,087, and 1,078 for models 1, 2, and 3, respectively. Table 6 shows the results of the mutation testing experiments on the new test data samples.

    Improving Testing of Deep-learning Systems

    The same experiments had a significant increase in the average mutation score, which shows that the new test dataset has higher class diversity in terms of covering mutated models. The higher mutation score signifies that the test data has a better ability to kill the target classes of the mutated models, indicating a higher quality of the test data and testing effectiveness.



    This work studied the effect on the quality testing of deep-learning systems when using mutation testing in combination with test cases generated using differential testing. Mutation testing allows the test-data quality to be assessed using a mutation score that checks how much of the test data kills the target classes of the mutated models. On average, these experiments showed an increase of about 6 percent in the mutation score, indicating improvement in testing effectiveness and quality of the test data when including generated test cases from differential testing in the test dataset for mutation testing.



    This material is based on work funded and supported by the 2020 IndustryXchange Multidisciplinary Research Seed Grant from Pennsylvania State University.



    1. LeCun, Y., Cortes, C., Burges, C. The MNIST database of handwritten digits;

    2. Ma, L., Zhang, F., Sun, J., Xue, M., Li, B., Juefei-Xu, F., Xie, C., Li, L., Liu, Y., Zhao, J., Wang, Y. 2018. DeepMutation: mutation testing of deep learning systems. IEEE 29th International Symposium on Software Reliability Engineering (ISSRE);

    3. Pei, K., Cao, Y., Yang, J., Jana, S. 2017. DeepXplore: automated whitebox testing of deep learning systems. Proceedings of the 26th Symposium on Operating Systems Principles;


    Harsh Deokuliar is a graduate student pursuing an M.S. degree in data analytics at Pennsylvania State University. His research interests include testing and explainability of AI systems, and computer vision using drone-based imagery.

    Raghvinder S. Sangwan earned his Ph.D. in computer and information sciences from Temple University. He is the director of engineering programs and a professor of software engineering at Pennsylvania State University. His teaching and research involve analysis, design, and development of software-intensive systems, their architecture, and automatic/semiautomatic approaches to assessment of their design and complexity, technical debt management, and AI engineering. He actively consults for Siemens Corporate Technology in Princeton, N.J., and is affiliated as a visiting scientist with the Software Engineering Institute at Carnegie Mellon University. He is an IEEE distinguished contributor and senior member of the ACM.

    Youakim Badr received a Ph.D. in computer science from the National Institute of Applied Sciences (INSA-Lyon), France. He is a professor of data analytics and artificial Intelligence, and professor-in-charge for the Master of Artificial Intelligence program at Pennsylvania State University Great Valley. Dr. Badr's research is primarily centered on the design and deployment of trustworthy AI service systems. He adopts a comprehensive and interdisciplinary approach, emphasizing data centric AI analytics, trustworthy AI systems, and composable AI systems. Dr. Badr has had over 140 peer-reviewed publications, including three books. Additionally, he fills the role of a reviewer for both national and international research funding programs (NSF, ANR, NSERC, Horizon Europe). Dr. Badr is honored with a lifetime membership with ACM and holds an academic associate membership of the Linux Foundation for AI and Data (LFAI&Data).

    Satish M. Srinivasan received a B.E. in information technology from Bharathidasan University in India and an M.S. in industrial engineering and management from the Indian Institute of Technology in Kharagpur, India. He earned his Ph.D. in information technology from the University of Nebraska at Omaha. Prior to joining Penn State Great Valley, he worked as a postdoctoral research associate at the University of Nebraska Medical Center, Omaha. He teaches courses related to database design; data mining; data collection and cleaning; computer, network, and web security; and business process management. His research interests include data aggregation in partially connected networks, fault-tolerance, software engineering, social network analysis, data mining, machine learning, big data, and predictive analytics and bioinformatics.

    Copyright © 2023 held by owner/author. Publication rights licensed to ACM

  • acmqueue

    Originally published in Queue vol. 21, no. 5
    Comment on this article in the ACM Digital Library

    More related articles:

    Divyansh Kaushik, Zachary C. Lipton, Alex John London - Resolving the Human-subjects Status of Machine Learning's Crowdworkers
    In recent years, machine learning (ML) has relied heavily on crowdworkers both for building datasets and for addressing research questions requiring human interaction or judgment. The diversity of both the tasks performed and the uses of the resulting data render it difficult to determine when crowdworkers are best thought of as workers versus human subjects. These difficulties are compounded by conflicting policies, with some institutions and researchers regarding all ML crowdworkers as human subjects and others holding that they rarely constitute human subjects. Notably few ML papers involving crowdwork mention IRB oversight, raising the prospect of non-compliance with ethical and regulatory requirements.

    Alvaro Videla - Echoes of Intelligence
    We are now in the presence of a new medium disguised as good old text, but that text has been generated by an LLM, without authorial intention—an aspect that, if known beforehand, completely changes the expectations and response a human should have from a piece of text. Should our interpretation capabilities be engaged? If yes, under what conditions? The rules of the language game should be spelled out; they should not be passed over in silence.

    Edlyn V. Levine - Cargo Cult AI
    Evidence abounds that the human brain does not innately think scientifically; however, it can be taught to do so. The same species that forms cargo cults around widespread and unfounded beliefs in UFOs, ESP, and anything read on social media also produces scientific luminaries such as Sagan and Feynman. Today's cutting-edge LLMs are also not innately scientific. But unlike the human brain, there is good reason to believe they never will be unless new algorithmic paradigms are developed.

    Zachary Tellman - Designing a Framework for Conversational Interfaces
    Wherever possible, business logic should be described by code rather than training data. This keeps our system's behavior principled, predictable, and easy to change. Our approach to conversational interfaces allows them to be built much like any other application, using familiar tools, conventions, and processes, while still taking advantage of cutting-edge machine-learning techniques.

    © ACM, Inc. All Rights Reserved.