The Effects of Mixing Machine Learning and Human Judgment

Collaboration between humans and machines does not necessarily lead to better outcomes.

Michelle Vaccaro and Jim Waldo

In 1997 IBM's Deep Blue software beat the World Chess Champion Garry Kasparov in a series of six matches. Since then, other programs have beaten human players in games ranging from Jeopardy to Go. Inspired by his loss, Kasparov decided in 2005 to test the success of Human+AI pairs in an online chess tournament.² He found that the Human+AI team bested the solo human. More surprisingly, he also found that the Human+AI team bested the solo computer, even though the machine outperformed humans.

Researchers explain this phenomenon by emphasizing that humans and machines excel in different dimensions of intelligence.⁹ Human chess players do well with long-term chess strategies, but they perform poorly at assessing the millions of possible configurations of pieces. The opposite holds for machines. Because of these differences, combining human and machine intelligence produces better outcomes than when each works separately. People also view this form of collaboration between humans and machines as a possible way to mitigate the problems of bias in machine learning, a problem that has taken center stage in recent months.¹²

We decided to investigate this type of collaboration between humans and machines using risk-assessment algorithms as a case study. In particular, we looked at the COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) algorithm, a well-known (perhaps infamous) risk-prediction system, and its effect on human decisions about risk. Many state courts use algorithms such as COMPAS to predict defendants' risk of recidivism, and these results inform bail, sentencing, and parole decisions.

Prior work on risk-assessment algorithms has focused on their accuracy and fairness, but it has not addressed their interactions with human decision makers who serve as the final arbitrators. In one study from 2018, Julia Dressel and Hany Farid compared risk assessments from the COMPAS software and Amazon Mechanical Turk workers, and found that the algorithm and the humans achieved similar levels of accuracy and fairness.⁶ This study signals an important shift in the literature on risk-assessment instruments by incorporating human subjects to contextualize the accuracy and fairness of the algorithms. Dressel and Farid's study, however, divorces the human decision makers and the algorithm when, in fact, the current model indicates that humans and algorithms would work in tandem.

Our work, consisting of two experiments, therefore first explores the influence of algorithmic risk assessments on human decision-making and finds that providing the algorithm's predictions does not significantly affect human assessments of recidivism. The follow-up experiment, however, demonstrates that algorithmic risk scores act as anchors that induce a cognitive bias: If we change the risk prediction made by the algorithm, participants assimilate their predictions to the algorithm's score.

The results thus highlight potential shortcomings with the existing human-in-the-loop frameworks. On the one hand, when algorithms and humans make sufficiently similar decisions their collaboration does not achieve improved outcomes. On the other hand, when algorithms fail, humans may not be able to compensate for their errors. Even if algorithms do not officially make decisions, they anchor human decisions in serious ways.

Experiment One: Human-Algorithm Similarity, not Complementarity

The first experiment examines the impact of the COMPAS algorithm on human judgments concerning the risk of recidivism. COMPAS risk scores were used because of the data available on that system, its widespread usage in prior work about algorithmic fairness, and the use of the system in numerous states.

Methods

The experiment entailed a 1 x 3 between-subjects design with the following treatments: control, in which participants see only the defendant profiles; score, in which participants see the defendant profiles and the defendant COMPAS scores; and disclaimer, in which participants see the defendant profiles, the defendant COMPAS scores, and a written advisement about the COMPAS algorithm.

Participants evaluated a sequence of defendant profiles that included data on gender, race, age, criminal charge, and criminal history. These profiles described real people arrested in Broward County, Florida, based on information from the dataset that ProPublica used in its analysis of risk-assessment algorithms.¹ While this dataset originally contained 7,214 entries, this study applied the following filters before sampling for 40 profiles that were presented to participants:

• Limit to black and white defendants. Prior work on the accuracy and fairness of the COMPAS algorithm limits their analyses to white and black defendants.^3,4,6 To compare the results from this experiment with those in prior studies, this study considers only the subset of defendants who identify as either African-American (black) or Caucasian (white).

• Exclude cannabis crimes. Interestingly, the pilot study showed participant confusion about cannabis-related crimes such as possession, purchase, and delivery. In the free-response section of the survey, participants made comments such as "Cannabis is fully legal here." To avoid confusion about the legality of cannabis in various states, this study excludes defendants charged with crimes containing the term cannabis.

From this filtered dataset 40 defendants were randomly sampled. A profile was generated containing information about the demographics, alleged crime, criminal history, and algorithmic risk assessment for each of the defendants in the sample. The descriptive paragraph in the control treatment assumed the following format, which built upon that used in Dressel and Farid's study:⁶

The defendant is a [RACE] [SEX] aged [AGE]. They have been charged with: [CRIME CHARGE]. This crime is classified as a [CRIMINAL DEGREE]. They have been convicted of [NON-JUVENILE PRIOR COUNT] prior crimes. They have [JUVENILE-FELONY COUNT] juvenile felony charges and [JUVENILE-MISDEMEANOR COUNT] juvenile misdemeanor charges on their record.

The descriptive paragraph in the score treatment added the following information:

COMPAS is risk-assessment software that uses machine learning to predict whether a defendant will commit a crime within the next two years. The COMPAS risk score for this defendant is [SCORE NUMBER]: [SCORE LEVEL].

Finally, the descriptive paragraph in the disclaimer treatment provided the following information below the COMPAS score, which mirrored the language the Wisconsin Supreme Court recommended in State v Loomis:¹⁸

Some studies of COMPAS risk-assessment scores have raised questions about whether they disproportionately classify minority offenders as having a higher risk of recidivism.

Upon seeing each profile, participants were asked to provide their own risk-assessment scores for the defendant and indicate if they believed the defendant would commit another crime within two years. Using dropdown menus, they answered the questions shown in figure 1.

The Effects of Mixing Machine Learning and Human Judgment

We deployed the task remotely through the Qualtrics platform and recruited 225 respondents through Amazon Mechanical Turk, 75 for each treatment group. All workers could view the task title, "Predicting Crime"; task description, "Answer a survey about predicting crime"; and the keywords associated with the task, "survey, research, and criminal justice." Only workers living in the United States could complete the task, and they could do so only once. During the pilot study among an initial test group of five individuals, the survey required an average of 15 minutes to complete. As the length and content of the survey resembled that of Dressel and Farid's,⁶ we adopted their payment scheme, giving workers $1 for completing the task and a $2 bonus if the overall accuracy of the respondent's predictions exceeded 65 percent. This payment structure motivated participants to pay close attention and provide their best responses throughout the task.^6,17

Results

Figure 2 shows the average accuracy of participants in the control, score, and disclaimer treatments. The error bars represent the 95 percent confidence intervals. The results suggest that the provision of COMPAS scores did not significantly affect the overall accuracy of human predictions of recidivism. In this experiment, the overall accuracy of predictions in the control treatment (54.2 percent) did not significantly vary from those in the score treatment (51.0 percent) (p = 0.1460).

The Effects of Mixing Machine Learning and Human Judgment

The inclusion of a written advisement about the limitations of the COMPAS algorithm did not significantly affect the accuracy of human predictions of recidivism, either. Participants in the disclaimer treatment achieved an average overall accuracy rate of 53.5 percent, whereas those in the score condition achieved 51.0 percent; a two-sided t-test indicated that this difference was not statistically significant (p = 0.1492).

Upon the conclusion of the task block in the exit survey, 99 percent of participants responded that they found the instructions for the task clear, and 99 percent found the task satisfying. In their feedback, participants indicated they had positive experiences with the study, leaving comments such as: "I thoroughly enjoyed this task"; "It was a good length and good payment"; and "Very good task."

Participants did not mention the advisement when asked how they took the COMPAS scores into account. Rather, their responses demonstrated that they used the COMPAS scores in different ways: some ignored them, some relied heavily on them, some used them as starting points, and others used them as sources of validation.

Figure 3 has excerpts of participant responses with a summary of answers to the free-response question: "How did you incorporate the COMPAS risk scores into your decisions?"

The Effects of Mixing Machine Learning and Human Judgment

Discussion

When assessing the risk that a defendant will recidivate, the COMPAS algorithm achieves a significantly higher accuracy rate than participants who assess defendant profiles (65.0 percent vs. 54.2 percent). The results from this experiment, however, suggest that merely providing humans with algorithms that outperform them in terms of accuracy does not necessarily lead to better outcomes. When participants incorporated the algorithm's risk score into their decision-making process, the accuracy rate of their predictions did not significantly change. The inclusion of a written advisement providing information about potential biases in the algorithm did not affect participant accuracy, either.

Given research in complementary computing that shows coupling human and machine intelligence improves their performance,^2,9,11 this finding seems counterintuitive. Yet successful instances of human and machine collaboration occur under circumstances in which humans and machines display different strengths. Dressel and Farid's study demonstrates the striking similarity between recidivism predictions by Mechanical Turk workers and the COMPAS algorithm.⁶ This similarity may preclude the possibility of complementarity. Our study reinforces this similarity, indicating that the combination of human and algorithm is slightly (although not statistically significantly) worse than the algorithm alone and similar to the human alone.

Moreover, this study shows that the accuracy of participant predictions of recidivism does not significantly change when a written advisement about the appropriate usages of the COMPAS algorithm is included. The Wisconsin Supreme Court mandated the inclusion of an advisement without indicating that its effect on officials' decision-making was tested.¹¹ Psychology research and survey-design literature indicate that people often skim over such disclaimers, so they do not perform their intended purpose.¹⁰ In concurrence with such theories, the results here suggest that written advisements accompanying algorithmic outputs may not affect the accuracy of decisions in a significant way.

Experiment Two: Algorithms as Anchors

The first experiment suggested that COMPAS risk scores do not impact human risk assessments, but research in psychology implies that algorithmic predictions may influence humans' decisions through a subtle cognitive bias known as the anchoring effect: when individuals assimilate their estimates to a previously considered standard. Amos Tversky and Daniel Kahneman first theorized the anchoring heuristic in 1974 in a comprehensive paper that explains the psychological basis of the anchoring effect and provides evidence of the phenomenon through numerous experiments.¹⁹ In one experiment, for example, participants spun a roulette wheel that was predetermined to stop at either 10 (low anchor) or 65 (high anchor). After spinning the wheel, participants estimated the percentage of African nations in the United Nations. Tversky and Kahneman found that participants who spun a 10 provided an average guess of 25 percent, while those who spun a 65 provided an average guess of 45 percent. They rationalized these results by explaining that people make estimates by starting from an initial value, and their adjustments from this quantity are typically insufficient.

While initial experiments investigating the anchoring effect recruited amateur participants,¹⁹ researchers also observed similar anchoring effects among experts. In their seminal study from 1987, Gregory Northcraft and Margaret Neale recruited real estate agents to visit a home, review a detailed booklet containing information about the property, and then assess the value of the house.¹⁶ The researchers listed a low asking price in the booklet for one group (low anchor) and a high asking price for another group (high anchor). The agents who viewed the high asking price provided valuations 41 percent greater than those who viewed the lower price, and the anchoring index of the listing price was likewise 41 percent. Northcraft and Neale conducted an identical experiment among business school students with no real estate experience and observed similar results: the students in the high anchor treatment answered with valuations that exceeded those in the low anchor treatment by 48 percent, and the anchoring index of the listing price was also 48 percent. Their findings, therefore, suggested that anchors such as listing prices bias the decisions of trained professionals and inexperienced individuals similarly.

More recent research finds evidence of the anchoring effect in the criminal justice system. In 2006 Birte Englich, Thomas Mussweiler, and Fritz Strack conducted a study in which judges threw a pair of dice and then provided a prison sentence for an individual convicted of shoplifting.⁷ The researchers rigged the dice so that they would land on a low number (low anchor) for half of the participants and a high number (high anchor) for the other half. The judges who rolled a low number provided an average sentence of five months, whereas the judges who rolled a high number provided an average sentence of eight months. The difference in responses was statistically significant, and the anchoring index of the dice roll was 67 percent. In fact, similar studies have shown that sentencing demands,⁷ motions to dismiss,¹³ and damages caps¹⁵ also act as anchors that bias judges' decision-making.