Abstract
The U.S. criminal-justice system has consequentialist and retributivist goals: It considers what is best for society as well as how to punish people in a manner proportional to their crimes. In deciding on the degree of retribution that is called for, the system attempts to determine the blameworthiness—or culpability—of the people involved, weighing factors such as their ability to think rationality, their freedom from coercion, and whether their actions were out of character. These determinations hinge on social-scientific research that is not strong enough to justify such decisions. In this article, I challenge the social-scientific basis for determining culpability on three grounds: replicability, generalizability, and inferential strength. In light of the limitations of this research, I argue that the criminal-justice system should abandon its retributive goals and pursue a more consequentialist—and more reparative—form of justice.
Instead of asking whether anyone should be locked up or go free, why don’t we think about why we solve problems by repeating the kind of behavior that brought us the problem in the first place?
In many ways, the trial of Daryl Atkins was a step toward a more progressive and compassionate legal system. In Atkins v. Virginia (2002), the U.S. Supreme Court ruled 6–3 that it is unconstitutional to execute people with intellectual disabilities because it violates the Eighth Amendment’s ban on cruel and unusual punishment. In other words, the lines that demarcate levels of blameworthiness were redrawn to offer people with intellectual disability more leniency. Disconcertingly, however, these lines are drawn differently in different states (LaPrade & Worrall, 2020). In North Carolina the line is drawn at an IQ of 70; people who score a 71 can be executed. In Arkansas it is 65. In Florida, it is “two or more standard deviations from the mean score on a standardized intelligence test,” (Laprade & Worrall, 2020, p. 8) meaning that an individual could fall on one side of the line or the other depending on the performance of the broader population. In Atkins’s own precedent-setting case, a jury decided in 2005 that his interactions with his lawyers were so intellectually stimulating that his IQ had risen above 70, meaning the line drawn in Virginia no longer worked in his favor.
These decisions are a matter of life and death. If the supporting science does not provide a solid basis for setting the cutoff at 70, 65, or some other score—or a score on an entirely different test—how can we be confident that we are drawing these lines correctly? This question becomes even more significant when we consider that arbitrary IQ cutoffs are just one example of the ways in which sentencing decisions hinge on social-scientific research. Such research routinely informs assessments of past experiences (e.g., trauma), present mental states (e.g., rationality), and future propensity to commit crimes (e.g., dispositional aggression) in ways that influence determinations of culpability (Monahan & Walker, 2011). According to Monahan and Walker (2011), in the past several years it has become more difficult to find a Supreme Court constitutional decision implicating an empirical question in which at least one side did not cite to social science research than it has been to find a decision in which at least one side cited such research. (p. 80)
Given the pervasive reliance on social science to determine levels of blameworthiness it is important to evaluate whether this reliance is justified. I argue that social-scientific evidence is not strong enough to inform decisions about culpability because of serious limitations in three domains: replicability, generalizability, and inferential strength. Thus, we have an obligation to reenvision our evaluation of blameworthiness. Given this state of affairs, the criminal-justice system should give up the task of assigning blame—and of selecting destructive punishments to even the scales—and instead focus on promoting societal well-being.
The Criminal-Justice System Has Consequentialist and Retributivist Goals
Criminal sentences in the United States are currently designed to accomplish two types of goals. The first type—consequentialist goals—are those that attempt to maximize the greater good and include things such as protecting people from harm and deterring future crime. Although these are challenging goals to attain, and sometimes even to articulate, they are necessary if we expect the criminal-justice system to strive to enhance societal well-being.
The second type—retributivist goals—are those that attempt to give people the punishment they deserve. Retributivist goals are clearly expressed in the legal system’s principle of penal proportionality, which dictates that fair punishment should be calibrated to the actor’s blameworthiness (Steinberg & Scott, 2003). Or, in the flowerier words of Kant (1797/1887), people should be punished in proportion to their “internal wickedness” (p. 238). Although retributivist goals are deeply rooted in the legal system as it stands, they should be abandoned because we cannot be sure we will get it right, and we cannot afford to get it wrong. The social-scientific evidence used to shape our evaluation of blameworthiness—or, in legal terms, culpability—is simply not up to the very high standard required.
The Role of Behavioral Evidence in Establishing Culpability
Culpability is not automatic, even when there is no question that a person has committed a crime. According to the law, culpability is mitigated when an actor’s rationality is undermined, when an action is coerced, or when the action is out of character (Morse, 2006; Steinberg & Scott, 2003). For example, mental illness or extreme emotional distress could make someone less culpable for their actions. If a person commits a crime while someone else holds a gun to their head, their lack of personal agency will be considered. If a person has no previous record of breaking the law, the legal system will take this into account when determining the severity of their punishment. Thus, the law considers these mitigating factors when it evaluates culpability and then—in retributivist fashion—attempts to choose the proportional punishment.
One assumption underlying this approach to punishment is that some people are, in fact, culpable. In their article examining the intersection of neuroscience and the law, Greene and Cohen (2004) questioned this assumption. They argued that people do not have free will and therefore that no one meets the legal system’s requirements for culpability. If it is not the actor but instead some combination of uncontrollable factors (e.g., their brain, genes, and the environment) that is responsible for the crime, then it does not make sense to pursue retribution, and the legal system should focus solely on consequentialist goals. I make room for the possibility that Greene and Cohen are correct. However, it is not necessary to abandon the notion of free will to arrive at the same consequentialist vision.
In addition to assuming that some people are culpable, the law makes a second, more tenuous, assumption: It assumes that it can tell the difference between who is culpable and who is not. In other words, the law assumes that it can determine whether a person was acting rationally, whether an action was free of coercion, or whether an action was in line with a person’s character. According to Morse (2006), these determinations ultimately hinge on behavioral evidence provided by social science: “The specific criteria for prima facie responsibility and excuse are all behavioral, broadly conceived as conduct and mental states” (p. 399). Unfortunately, large leaps are required to go from the kinds of data even the most rigorous social-scientific investigations are capable of providing to the kinds of inferences that are made about the mental states of specific individuals; we underestimate the monumental task of establishing culpability and as a consequence mistakenly believe that social science can accomplish it.
At first glance, it may seem that there are many sentencing decisions that are made without reference to social-scientific evidence. Take, for example, a case in which an adult with no diagnosis of mental illness or history of abuse confesses to a third violent crime. Even in cases such as this one, in which mitigating or excusing factors are not seriously considered and the decision is made on the basis of forensic evidence, social-scientific evidence is still at play. This is because social science is used to distinguish between people who can be deemed fully culpable and people who cannot. When a person, such as the one in the example, is punished to the full extent of the law, it is because social science has placed them in a category that affords this outcome. Social science is always at play, albeit often behind the scenes.
The Problem With Culpability
Criticisms of the way that the legal system determines culpability are not uncommon. Sometimes these challenges are internal: They accept the concept of culpability and articulate ways of improving its application (Morse, 2006). Perhaps they propose a new mitigating factor, such as when the American Bar Association (2006) resolved that mentally ill defendants should not receive the death penalty. Or perhaps they bring new evidence to bear on existing factors, such as when Steinberg and Scott (2003) raised new social-scientific evidence that juveniles process information differently than adults. These arguments endorse the notion of retribution and aim to ensure that it is justly applied.
In other cases, challenges are external: They contend that the concept of culpability is incoherent or unjustifiable and push for it to be replaced or abandoned (Morse, 2006). This is the type of critique made by Greene and Cohen (2004) when they challenged the notion of free will. Culpability, they argued, becomes a nonsensical concept if we accept that we are living in a deterministic world. Others formulate external arguments that are rooted in a philosophical commitment to consequentialism, maintaining that retribution is not a valuable moral goal and thus culpability is irrelevant (Bentham, 1982). Critiques such as these reject the very notion that culpability matters.
The case I put forward here is most aptly characterized as an internal challenge, although it renders the notion of culpability effectively useless. Even if we accept that some people are more blameworthy than others and that it is morally defensible to assign punishment in a calibrated manner (and there are already good reasons to reject these premises), we still need to establish that we are capable of evaluating blameworthiness. I argue that we are not and that, although it is hypothetically conceivable that social science should, at some point, allow us to do so effectively, this point is so far from our current position as to be almost unfathomable.
Conditions That Must Be Met for Social Science to Establish Culpability
In recent years, the limitations of social-scientific research have become increasingly apparent. This observation is not a condemnation of social science but rather an acknowledgment that our expectations of social science have been unreasonably lofty. Take, for instance, IJzerman and colleague’s (2020) recent analysis of what would be required for social science to usefully inform COVID-19-related policy. This team of empirical psychologists argued that, despite the apparent relevance of psychological research to behavioral-health measures, the path to generating evidence strong enough to make sound decisions is formidable. Specifically, they proposed a nine-stage process for developing an empirically tested solution that is ready to be implemented. Much of social-scientific research currently stalls around the second or third stage, the point at which systematic reviews have been conducted but alternatives have not been ruled out and generalizability has not been established. IJzerman et al. (2020) concluded: “The way social and behavioral science research is often conducted makes it difficult to know whether our efforts will do more good than harm” (p. 1092).
It is worth considering, then, what would be required for social science to usefully inform decisions about culpability. It seems that, at a minimum, three conditions should be met. First, findings should be replicable; they should be reliably observed when a study is repeated. Second, findings should be generalizable; they should extend to real-world contexts rather than being contingent on methodological particulars (e.g., the specific psychological task, laboratory environment, or sample of participants). Third, the findings should provide inferential strength; they should allow us to draw confident conclusions about people who have been accused of crimes. In the following sections I consider each of these conditions in turn.
Replicability
Evidence regarding the replicability of social-scientific findings is often disappointing. Large-scale replication projects—which typically have a higher bar for methodological rigor and statistical power than original studies—have found replicability rates of 36% (Reproducibility Project: Psychology; Open Science Collaboration, 2015), 54% (Many Labs 2 in psychology; Klein et al., 2018), 30% (Many Labs 3 in psychology; Ebersole et al., 2016), 0% (Many Labs 5 in psychology; Ebersole et al., 2020), 61% (Experimental Economics Replication Project; Camerer et al., 2016), and 62% (Social Sciences Replication Project; Camerer et al., 2018). Many scientists have expressed concern about these numbers; 95% said that the rate of false positives is “somewhat” or “much” higher than it should be in psychology (Miranda et al., 2021) and 90% reported a “slight” or “significant” replicability crisis in the sciences (Baker, 2016).
A few key phenomena underlie these low levels of replicability in the social sciences, including publication bias, low statistical power, and infrequent replication efforts (Bakker et al., 2012; Fanelli, 2010a, 2010b, 2012; Franco et al., 2014; Ioannidis, 2012; Pashler & Harris, 2012). Simulation data show that when scientific fields have a bias toward publishing positive findings, low statistical power, and a bias against publishing replications, this can lead to a literature in which findings are more likely to be false than true (Ioannidis, 2005).
Another factor compounding these issues is the use of questionable research practices (QRPs) and p-hacking (John et al., 2012; Simmons et al., 2011). If researchers do not specify their analysis plan in advance—a practice known as preregistration—the resulting array of reasonable analysis choices introduces the problem of capitalizing on chance, and analyses that yield significant (more publishable) results are reported while those that yield null results are ignored (Gelman & Loken, 2013). Researchers’ self-reports indicate that QRPs are not rare; most admit to selective reporting and some admit to outright fraud (Fiedler & Schwarz, 2016; John et al., 2012; Matthes et al., 2015). It is concerning that their use can cause false-positive rates to dramatically exceed the standard 5% cutoff and can thus further compromise the evidentiary value of the literature.
Generalizability
Even if the literature had high replicability rates and we could be confident that the findings reported in a given study could be reliably repeated, this would not be enough to qualify the literature as a sound resource for legal decision-making. It is also critical to understand the generalizability of findings and whether they are relevant in legal contexts.
Much of the challenge here stems from the difficulty in translating between numbers and words. Legal criteria for culpability are expressed verbally: Things such as rationality, freedom from coercion, and consistency with identity need to be established. In contrast, the social science that is used to establish rules for determining who meets these criteria is often quantitative: Things such as survey scores, accuracy rates, and reaction times are being assessed. Ensuring a meaningful connection between statistics and verbal concepts is a deceptively challenging step that requires articulating and formalizing the various tenets of verbal theories (for a tutorial depicting a dialogue between two fictive characters called “Verbal” and “Formal,” see van Rooij & Blokpoel, 2020). This step is often neglected, and confident verbal assertions pervade the literature in the absence of the statistical evidence required to back them up.
Yarkoni (2020) documented the misalignment between verbal claims and statistical tests in psychology and concluded that the field is undergoing not only a replicability crisis but also a “generalizability crisis.” He pointed out that standard statistical approaches impose tight constraints on our ability to extend results beyond the particular circumstances in which they were obtained. These constraints are rarely acknowledged, leaving the verbal claims made in social-scientific reports—and used to inform legal reasoning—untethered from what the numbers actually show. In considering this disconnect, Yarkoni (2020) contended that “a good deal of what currently passes for empirical psychology is already best understood as insightful qualitative analysis trying to quietly pass for quantitative science” (p. 14).
The generalizability issues identified by Yarkoni (2020) operate at a number of levels relevant to the social-scientific research that currently informs decisions about culpability. Suppose, for example, researchers set out to test whether adolescents are more susceptible to coercion than adults. They might do so by recruiting a sample of adolescents and adults and exposing them to Asch’s conformity paradigm, in which participants experience social pressure to give wrong answers on a line-judgment task (Asch, 1956). Suppose that a standard statistical comparison reveals significantly higher rates of conformity among adolescents compared with adults. It could be tempting to consider this evidence that adolescents are more susceptible to coercion than adults—a conclusion that, if justified, would be relevant to discussions of culpability.
Innocuous as this line of reasoning may seem, it involves several leaps. First is the assumption that conforming by choosing the wrong line is the equivalent of conforming by breaking into a home or pulling a trigger. Second is the assumption that the participants are representative of the populations about which we hope to draw conclusions—in this case, all adolescents and all adults. Third is the assumption that one’s tendency to conform is synonymous with one’s “susceptibility to coercion,” the language used in legal contexts. These kinds of assumptions are rarely acknowledged, and if they were it would immediately reveal the shaky ground on which typical conclusions are based.
This state of affairs is, unfortunately, something not easily addressed. To garner the necessary evidence to support the broad generalizations we seek would be a gargantuan task requiring resources that are orders of magnitude above what we typically devote (Yarkoni, 2020). These limitations raise serious questions about what social science can offer to legal decision-making. If the social sciences cannot accomplish the task that distinguishes them from other ways of knowing—substantiating verbal claims with empirical data—then they should not be afforded added credibility.
Inferential strength
For the social sciences to meaningfully inform decisions about culpability, a third condition also needs to be met: Findings about human behavior need to be so clear that we can confidently use them as a basis for making inferences about individuals (a process sometimes called “group-to-individual inference”; Faigman et al., 2014). In other words, we need to be so sure about what a characteristic (e.g., age, mental-illness diagnosis, brain activity) reveals about culpability that we know it has this meaning for the specific person accused of a crime.
As an example of the challenges with group-to-individual inference, consider a study by Glenn et al. (2010) that showed that the average volume of the striatum—a brain region associated with impulsivity—is larger among people with psychopathy compared with controls. Imagine that having a large striatum could be considered a brain abnormality that systematically causes people to commit crimes. This might prompt us to conclude that people with psychopathy should be considered less blameworthy than others because of their unique neural anatomy. A closer look at the data (Fig. 1), however, reveals the problem with this kind of inference. Although the black bars in Figure 1—representing the means for each group—are statistically different between the psychopathy and control groups, it is clear that many individuals with psychopathy have lower striatal volumes than many individuals without psychopathy. In other words, we cannot use this finding to infer that a person diagnosed with psychopathy must have a brain that predisposes them to criminal acts.

Scatterplots and means (horizontal bars) for volumes of the left and right striatum in the control group (n = 22) and the psychopathy group (n = 22). Figure reprinted from Glenn, A. L., Raine, A., Yaralian, P. S., & Yang, Y. (2010). Increased volume of the striatum in psychopathic individuals. Biological Psychiatry, 67, 52–58. Copyright 2010 Elsevier. Used with permission.
Applying general findings to specific individuals is a practice that has been challenged by both legal scholars (Faigman et al., 2014; Hart et al., 2007; Morse, 2006) and psychological researchers (Bolger et al., 2019; Fisher et al., 2018; Molenaar, 2004). Molenaar (2004) noted that group-level findings can be accurately applied to individuals only when a psychological process is ergodic. To qualify as ergodic, patterns of variability between individuals must align with patterns of variability within individuals, a criterion that is rarely met, or even examined, within the social sciences (Fisher et al., 2018). Fisher and colleagues (2018) cautioned that this problem seriously undermines the validity of applying social-scientific findings to individual people.
All this is to say that even when social-scientific findings are highly replicable and generalizable, they are still probabilistic; at best, they allow us to make guesses about individuals that are more informed than if we knew nothing about them. When we consider that many published effects are not replicable, and a vanishingly small proportion of those can be generalized to legal contexts, these informed guesses start to seem more like shots in the dark.
Do the problems not cancel out?
It might seem that these concerns are addressed by aggregating large bodies of evidence. Even if individual studies might have flaws, do these flaws not cancel out when many studies are considered collectively? This would indeed be the case if there were no systematic biases in published findings. However, publication is much more likely for some results (i.e., results showing an effect) than others (i.e., results not showing an effect). Rather than balancing out error, aggregating studies with systemic biases serves to compound those biases (van Elk et al., 2015). These kinds of concerns have led some to conclude that analyzing the results of multiple studies in tandem, particularly if these studies all come from the published literature, can be even more misleading than taking a single study as the final word (Scheel et al., 2021).
Are these problems not solvable?
Some of the problems I have identified above can be improved on and are being improved upon. For instance, in psychology, preregistration has gone from being essentially nonexistent to being encouraged by many high-profile journals, such as Nature Human Behaviour, Psychological Science, and Social Psychological and Personality Science. Data and materials are being made open to the public with much higher frequency than 10 years ago, allowing the research community to test the statistical robustness of reported findings (Kidwell et al., 2016). More than 280 journals now offer a “Registered Report” option—an article format introduced in 2013 that combats publication bias by evaluating articles before data collection so that assessments of quality are based on methodological rigor rather than results (Chambers et al., 2014). These changes provide grounds for optimism that the social sciences will become increasingly valuable and increasingly suited to solving real-world problems. Establishing culpability, however, is not just any real-world problem—it is a problem in which mistakes result in unjust imprisonment, and even execution. In the case of establishing culpability, the bar should be extraordinarily high. It is these lofty levels of certainty that are not, and may never be, achievable with social-scientific evidence.
An Example of Using Social Science to Establish (Lack of) Culpability
To argue that the legal process of determining culpability is undermined by the limitations of social-scientific research, it should first be clear that legal system actually relies on that research. Morse (2006) has noted that this is, and indeed must be, the case. To elaborate on this point, I discuss an example of the application of social-scientific research to the question of culpability—an article by Steinberg and Scott (2003)—and analyze the evidence within. Steinberg and Scott relied on the concept of penal proportionality to argue that adolescents deserve less severe punishment and are less blameworthy than adults because adolescents (a) have diminished decision-making capacity, (b) are more susceptible to coercion, and (c) are still undergoing character change. These claims were supported by a range of social-scientific research studies.
Steinberg and Scott’s (2003) article weighed heavily in the Roper v. Simmons (2005) decision to abolish the death penalty for individuals under the age of 18, which in turn set the stage for the Graham v. Florida (2010) decision to eliminate life without parole for individuals under the age of 18 except in cases of homicide and then the Miller v. Alabama (2012) decision to eliminate life without parole for individuals under the age of 18 even in cases of homicide (Steinberg, 2013). I wholeheartedly agree with Steinberg and Scott’s conclusion that adolescents should not be put to death. However, when social-scientific evidence is used to establish the lack of culpability of one group, it reinforces the culpability of those that fall outside of that group. If there are shortcomings in this evidence, then there are shortcomings in our ability to establish the culpability of adults.
Steinberg and Scott’s (2003) article has received praise from legal scholars—even those critical of knee-jerk reliance on social-scientific work—as “one of a series of modern publications that has been highly regarded” (Denno, 2006, p. 395). Despite the fact that this article has been touted as a model, I argue that the evidence within is still inadequate to inform decisions about culpability. Moreover, it is precisely because this article does a good job—relying on some of the best available empirical work to address inherently empirical questions—that its inadequacy should call into question the entire enterprise of using social-scientific research in this manner.
Replicability in Steinberg and Scott (2003)
The most direct way of determining whether a finding is replicable—to be confident that it was not a false-positive finding—is to conduct a replication study. Such studies are relatively rare (Duvendack et al., 2015; Makel et al., 2012), and thus the replicability of a published finding is often hard to determine. There are, however, some useful indicators. First, the characteristics of the literature from which a study is drawn can be informative. For instance, if a literature is biased toward publishing significant (vs. null) findings, this increases the likelihood that a given study is a false positive (Ioannidis, 2012). Second, the characteristics of the specific study can be revealing. For example, studies that were not preregistered, have small samples, and report p values close to the .05 cutoff for statistical significance are more likely to be false positives (Camerer et al., 2016; Forsell et al., 2019; Simmons et al., 2020). If there is uncertainty about whether a finding might be a false positive or, worse, good reason to suspect it might be, then it is not ready to be used in legal decision-making contexts.
An example from Steinberg and Scott (2003) illustrates reasons to be skeptical about the replicability of some of the findings used to bolster their case. In their article, they cited a study by Greene (1986) to support the claim that adolescents are less “future-oriented” than adults. This study had a sample with 20 participants per cell, measured “future-time perspective” in four different ways, found a statistically significant result for one of four reported analyses (p = .05), and yielded a pattern only partly consistent with the hypothesis (i.e., future-time perspective scores were highest among college sophomores but also higher among 9th graders compared with 12th graders). With small samples, a p value equal to .05, and no preregistration to limit flexibility in data analysis (the norm at the time but problematic nonetheless), there is good reason to doubt that this finding would replicate.
Of course, the weaknesses in this specific study do not apply to all studies, and there are times when we can be much more confident in the replicability of effects. However, in studies without preregistration—the majority of even the recent social-sciences literature—it is often challenging to distinguish signal from noise. Furthermore, even when we can be confident that a finding is replicable, we still have to attain the (much higher) standards of generalizability and inferential strength.
Generalizability in Steinberg and Scott (2003)
One of the three legs on which Steinberg and Scott’s (2003) argument stands is the assertion that adolescents are less able to resist social influence—or “peer pressure”—than adults. Much of the evidence bolstering this claim comes from Berndt (1979), who examined how 251 children from the 3rd, 6th, 9th, 11th, and 12th grades responded to hypothetical situations in which peers encouraged either antisocial, prosocial, or neutral behaviors. As an example, one of the 10 antisocial situations asked children to imagine the following: You are with a couple of your best friends on Halloween. They’re going to soap windows, but you’re not sure whether you should or not. Your friends all say you should, because there’s no way you could get caught. What would you really do? (Berndt, 1979, p. 610)
The children were then asked to indicate what they would do on a 6-point scale; 1 indicated that they were “absolutely certain” they would not follow their friends’ suggestion and 6 indicated that they were “absolutely certain” they would. The results showed a significant quadratic age trend: 9th graders gave the highest mean responses to this question and younger and older participants gave lower responses.
According to Berndt, “the results indicate that conformity to peers peaks during midadolescence” (p. 612). Likewise, Steinberg and Scott applied these findings in their own article: “Susceptibility to peer influence increases between childhood and early adolescence . . . , peaks around age 14, and declines slowly during the high school years” (p. 1012), and “if adolescents are more susceptible to hypothetical peer pressure than are adults (as noted earlier), it stands to reason that age differences in susceptibility to real peer pressure will be even more considerable” (p. 1014).
The problem with Steinberg and Scott’s interpretation—and it is in no way unique in this regard—is that it is almost completely unsubstantiated by the evidence provided. The Berndt (1979) study observed higher rates of self-reported predictions of conformity in response to hypothetical scenarios among 9th graders (compared with 3rd, 6th, 11th, and 12th graders) who were largely White and attended a high school in a “middle- and working-class city” (p. 609) in the late 1970s. For this finding to support the general claims made by Steinberg and Scott, we would have to make a litany of assumptions, including but not limited to the downward trend from 9th to 11th and 12th grades continuing into, and throughout, adulthood (Berndt did not sample adults). The responses would also have to indicate susceptibility to peer pressure (mean responses, even among 9th graders, all fell below 3, indicating that the children said they would not follow the suggestions of their peers). The 10 antisocial scenarios would have to be interchangeable with all possible instances of antisocial behavior, particularly criminal behavior. Participants’ self-report responses to hypothetical scenarios would have to be accurate representations of what they would actually do. Finally, this largely White sample of high school students—typical of the Western, educated, industrialized, rich, and democratic (WEIRD) samples that overwhelmingly dominate psychological research (Henrich et al., 2010)—would have to be representative of adolescents throughout the United States, including those who are involved with the justice system.
A critic might argue that I cherry-picked a problematic study to malign the generalizability of all social-scientific research. However, when one starts to consider the kind of study that would be informative in legal contexts—the kind of study that would substantiate all of the assumptions made above—it becomes clear that such a study is far beyond the reach of social science as it is currently conducted (Yarkoni, 2020). There is nothing uniquely problematic about Berndt’s (1979) work, and for this reason its limitations are emblematic of the limitations of the field.
Inferential strength in Steinberg and Scott (2003)
An important component of the legal system’s concept of culpability is “the connection between a bad act and a bad character” (Steinberg & Scott, 2003, p. 1015). With this in mind, Steinberg and Scott argued that adolescents are still in the process of character development and that, as a consequence, it is too early to attribute their actions to “bad character.” In support of this claim they cited one review article by Waterman (1982) as evidence that “the resolution of [the adolescent identity crisis], with the coherent integration of the various retained elements of identity into a developed self, does not occur until late adolescence or early adulthood” (p. 1014).
Waterman (1982) referenced a longitudinal study of college students as evidence that identity begins to solidify in late adolescence (Waterman & Goldman, 1976). This study relied on a classification system composed of four “ego-identity statuses” thought to track a normative developmental trajectory: A person without any particular commitments to goals, values, or beliefs (identity diffusion) subsequently forms those commitments (foreclosure) and then experiences a crisis during which they question those commitments (moratorium) and then develop a new, more considered set of commitments (identity achievement). The researchers interviewed 59 students when they were college freshmen and again when they were seniors. Across three domains—occupational choice, religious beliefs, and political beliefs—the researchers assessed the ego-identity status attained by each of these students.
Even the most casual consideration of generalizability would likely raise a number of red flags about this framework (what, for instance, does identity achievement look like?), exposing our inability to apply these labels in meaningful ways to other people, in other cultures, in other decades. Beyond these concerns, however, it remains clear that these group-level results do not offer much help when it comes to making inferences about individuals’ identity development on the basis of their age. For the domain of occupational choice, 22 seniors were said to have met the criteria for identity achievement, but another 19 were put in the foreclosure category and 11 were said to be still at the identity-diffusion stage. For political beliefs, only 14 participants were put in the identity-achievement category; the remaining 25 participants were put in the identity-diffusion category. With so much variability, guessing that a participant had a fully developed identity because he or she was a senior would make you more likely to be wrong than right.
Nevertheless, these results led Waterman (1982) to conclude: “It is during the college years that the greatest gains in identity formation appear to occur” (p. 346). Steinberg and Scott then applied this result to their investigation: “Antisocial behavior in adolescence is not usually indicative of bad character” (Steinberg & Scott, 2003, p. 1015). The problem is simply that these conclusions do not apply to everyone. And even if we relied on collections of studies, each with large samples and reliable and valid measures, we could still not be confident in these inferences because the relationships between things such as age and identity formation are likely to be weak regardless of how well we measure them. We are ultimately left operating within a criminal-justice system that routinely assumes that adults who break the law are “bad people” who will continue to commit “bad actions,” despite little to no inferential basis for such conclusions.
Erring on the side of caution in Steinberg and Scott (2003)
The distinctions between the rational capacities of juveniles and adults are blurry at best. If we are to conclude that the evidence showing juveniles to be less culpable than adults is suspect (as we should), then we should question why we afford leniency to juveniles and not adults. Moreover, this logic applies to any other blurry distinction: between mental illness and mental health, between abusive and nonabusive upbringings, between coercive and freely chosen actions, and so on. Given that social science cannot draw clear lines between the criminally responsible and the forgivable—an unreasonable expectation from the outset—it seems sensible to conclude, as Steinberg and Scott (2003) did, that “it would be prudent to err on the side of caution, especially when life and death decisions are concerned” (p. 1017).
Solution: Abandon Retribution
In baseball, there are times when even the most experienced and attentive umpires cannot tell whether the runner touched the base before being tagged by the defender’s glove. Some plays are just too ambiguous. For this reason, there is a rule to guide umpires in these situations: A tie goes to the runner. In other words, a runner must be given the benefit of the doubt and be called safe if an umpire cannot confidently say that the runner was out.
There is a clear analogue to this rule in the criminal-justice system: A defendant is considered innocent until proven guilty. It is not a coincidence that this rule gives the benefit of the doubt to the defendant and places the burden of proof on the prosecution. This principle reflects the system’s recognition that—in determining a person’s sentence—some mistakes are worse than others. Specifically, it is worse to mistakenly find an innocent person guilty than it is to find a guilty person innocent. This is why ambiguity about blameworthiness poses such a daunting challenge to the legal status quo. The criminal-justice system already commits to withholding retribution until culpability is established for certain. Thus, concluding that we are perpetually uncertain about culpability would entail that retribution should be withheld in all cases.
I have argued in favor of such a conclusion; I have attempted to make the case that social-scientific evidence is insufficient to provide us with certainty about culpability. As a result, I propose that we remain agnostic about culpability and remove retributivist considerations when responding to criminal actions, instead focusing solely on consequentialist considerations. This shift does not entail the elimination of outcomes that could be considered punitive. For instance, parents who have acted abusively could be separated from their children for the sake of the children’s protection, without any consideration of blameworthiness or proportional punishment. However, this shift would fundamentally alter the ethos of criminal sentencing from one of payback to one of repair.
One might object that switching to a consequentialist system does not make things any easier. Consider, for instance, the debate about using the Hare Psychopathy Checklist–Revised (Hare, 1991) to assess whether an individual is likely to cause violent harm to others in a prison context. Some argue that the checklist provides critical information for identifying people who could be a threat to the safety of others (Hare, 2003; Olver et al., 2020), whereas others contend that the checklist is inadequate and can contribute to unjustly severe sentencing of those considered high risk (DeMatteo et al., 2020). There is, however, an important distinction between consequentialist and retributivist goals. Applying social science to consequentialist ends is necessary if we want public systems that are responsible for justice and public safety. On the other hand, applying social science to retributivist ends takes the unnecessary risk most condemned by the criminal-justice system—the risk of unjust punishment. Despite the fact that the two may be equally challenging, consequentialist goals are necessary to reduce harm, whereas retributivist goals are not. In fact, retributivist goals frequently serve to increase harm.
Fortunately, the legal system has always pursued consequentialist goals and has become practiced in prioritizing them when the current system of imaginary lines prompts the legal system to consider someone less blameworthy. Recall, for instance, the opening example of Daryl Atkins. The legal system has progressed to a point that people with intellectual disabilities are now treated through a more consequentialist, less retributivist lens, prioritizing community support and recognizing the harm of incarceration (Jones, 2007). There is no reason this approach—which weighs the well-being of both society and the person convicted of a crime—could not be extended to everyone.
Perhaps the most common objection to this idea is the concern that a nonretributive system would justify lenient punishments for people who seem to deserve worse. Greene and Cohen (2004) noted that one of the biggest obstacles to an effective consequentialist legal system is the fact that “retributivist principles have a powerful moral and political appeal” (p. 1783). This appeal, however, is not static. It has evolved in significant ways in the past, allowing us to appreciate mitigating factors—such as age and intellectual disability—that were not previously recognized. One consequence of the limitations of social science is that there will always be a host of mitigating factors that we cannot capture or do not even know to look for. The French proverb tout comprendre c’est tout pardonner (“to understand all is to forgive all”), highlights the humanity that emerges when we begin to appreciate the complex web of influences behind people’s actions.
Moving away from a retributivist legal system is not a novel idea. Greene and Cohen (2004) made a case for such a shift and offered an external argument, attacking the concept of culpability at its core. The argument I have made here is internal—challenging the implementation of culpability but not dismissing the notion entirely. The two types of arguments are not mutually exclusive. The more we come to recognize the conceptual problems with culpability, the more apparent the challenges of implementation become; conversely, the more we erode the notion that we can operationalize blameworthiness, the easier it is to imagine a system that abandons the notion entirely.
Beyond psychology and philosophy, abolitionist scholars and activists have spent decades detailing the ways in which the concept of penal proportionality fails us; prisons are designed to exact revenge without heeding the corrosive effect on individuals and societies. Angela Davis (2003) articulated the following alternative: Positing decarceration as our overarching strategy, we would try to envision a continuum of alternatives to imprisonment—demilitarization of schools, revitalization of education at all levels, a health system that provides free physical and mental care to all, and a justice system based on reparation and reconciliation rather than retribution and vengeance. (p. 107)
Achieving this vision will require time and resources. Currently, however, the United States is investing staggering amounts of money in exacting retribution through incarceration. To support the largest prison population and highest per capita institutionalization rate in the world, the United States spends $182 billion on mass incarceration every year (Institute for Crime & Justice Policy Research, 2021; Wagner & Rabuy, 2017). Incarcerating a single individual costs approximately $35,000 per year, an amount that would comfortably cover housing and college tuition (Lewis & Lockwood, 2019; National Center for Education Statistics, 2021; U.S. Bureau of Labor Statistics, 2020). Instead, these resources could be used to support gradual decarceration and the development of social projects and community-based efforts that aim to prevent harm and expand nonretributive forms of justice (McLeod, 2015; Roberts, 2019). Abandoning retribution would ultimately bring us in better alignment with one of the most fundamental legal principles protecting individuals from unjust harm: the idea that we should err on the side of compassion.
