Advancing the Prediction of Prison Misconduct: Revalidating the Minnesota Severe and Frequent Estimate for Discipline (MnSafeD)

Abstract

In June 2024, the Minnesota Department of Corrections implemented the Minnesota Severe and Frequent Estimate for Discipline (MnSafeD), a fully-automated, gender-specific classification assessment that was designed to predict prison misconduct. Customized to Minnesota’s prison population, the MnSafeD was created to identify the relatively small group of residents disproportionately responsible for much of the misconducts and violence that take place in prison. This study revalidated the MnSafeD on the first 8,631 residents who were scored on the instrument. The results showed the MnSafeD performed very well overall, achieving an area under the curve (AUC) of 0.84. The instrument’s predictive accuracy was better for the men and for the post-intake reassessments (0.86). The results revealed there were no predictive disparities by race/ethnicity.

Keywords

risk assessment classification misconduct prison violence

Introduction

The majority of state prison systems in the U.S. use classification systems to assess residents for their risk of institutional misconduct (Duwe & Clark, 2025), which includes rule violations that run the gamut from disobeying orders and possession of “contraband” (i.e., alcohol, drugs, etc.) to assaults against staff and other residents (Camp et al., 2003). Prison misconduct not only increases the costs of incarceration (Lovell & Jemelka, 1996), but it also disrupts the safety and security of the environment for both residents and staff. Residents who experience victimization in prison have negative psychological outcomes (Johnson Listwan et al., 2010), while research has long shown that correctional officers experience a relatively high level of job stress that has been linked to a number of adverse mental and physical health effects (Brower, 2013).

Classification systems often contain an intake assessment, which focuses more on the resident’s prior criminal record because little is known about the institutional conduct for those with no history of incarceration, and a reclassification assessment that places more emphasis on the resident’s recent conduct during confinement (Austin, 2003). Combined, the intake assessment and reassessment are used to place residents at custody levels that are aligned with their risk for institutional misconduct. Higher-risk residents are typically assigned to more restrictive custody levels such as close or maximum, whereas lower-risk residents are placed at minimum or medium custody levels where they have fewer restrictions and greater freedom of movement within the facility.

Despite the widespread use of classification systems to assess the risk of institutional misconduct, there has been relatively little research to date on the performance of these instruments. Instead, much of the literature on risk assessment in corrections has focused on the prediction of recidivism, which is the most common outcome used for correctional populations. Of the extant research on institutional misconduct, most studies have examined the individual and situational characteristics associated with rule violations. Others, however, have focused on validating existing risk assessment tools or attempting to create new ones specifically designed to predict outcomes related to prison misconduct.

Literature Review

Individual and Situational Factors for Prison Misconduct

The two main perspectives that have been used to explain prison misconduct are importation and deprivation. Whereas importation argues that misconduct occurs as a result of the characteristics and experiences that residents bring with them into prison, deprivation holds that situational factors within the prison environment influence resident misconduct (Tewksbury et al., 2014). Given that classification assessments are applied to individuals, the findings from the importation literature are especially relevant to identifying individual-level characteristics that influence misconduct.

Prior research has shown that prison misconduct is a significant predictor of recidivism (Gendreau et al., 1996; Hamilton et al., 2021). Because both are forms of rule-violating behavior, prison misconduct and recidivism share many of the same risk and protective factors. As with recidivism, some of the strongest predictors of misconduct are static factors like criminal history, age, and race (Gendreau et al., 1997).

While research has shown that dynamic factors, or criminogenic needs, such as antisocial thinking, antisocial peers, education, employment, and family/domestic relationships have an impact on recidivism risk, the same is true for prison misconduct. For example, a recent study found that antisocial thinking, as measured by the Texas Christian University-Criminal Thinking Scales (TCU-CTS), is positively associated with prison misconduct (Duwe et al., 2023). A number of studies have shown that gang membership (i.e., identification as a member of a security threat group) is positively associated with rule violations (Gaes et al., 2002; Griffin & Hepburn, 2006; Tewksbury et al., 2014), which is consistent with the finding that antisocial peers increase the risk of prison misconduct (Gendreau et al., 1997). The study by Gendreau et al. (1997) also revealed that education, employment, and marital status are significant predictors of institutional misconduct. Although research has not demonstrated that childhood trauma is a criminogenic need, a recent study by Clark and Duwe (2024) showed that past exposure to adverse childhood experiences increased prison misconduct for males, but not for females. Likewise, despite not having a strong empirical relationship with recidivism, a history of mental illness has been cited as a risk factor for misconduct (Austin, 2003).

Just as research has shown that cognitive-behavioral, education, and employment programing reduce recidivism (Gendreau et al., 1996), so, too, have studies on prison misconduct. Reale et al. (2025) recently found that a cognitive-behavioral intervention reduced misconduct, which is consistent with the conclusion drawn by French and Gendreau (2006) that cognitive-behavioral treatment programs are the most effective intervention for curbing disciplinary infractions. Education and employment programing have also been found to reduce misconduct (Duwe et al., 2015; Gover et al., 2008; Steiner & Wooldredge, 2014), although their effectiveness has been more modest and sometimes inconsistent. More generally, greater involvement in programing, which also entails less idle time, has been noted as important for reducing misconduct (Austin, 2003).

Prior Research on the Development and Validation of Classification Assessments

The last few decades of the twentieth century were marked not only by rapid growth in the implementation of classification assessments by state prison systems (Austin, 2003), but also by the increased use of actuarial tools (i.e., instruments developed on the basis of statistical methods) to assess recidivism risk. The emergence of the risk-needs-responsivity (RNR) model during this time shifted the focus of recidivism risk assessment beyond static factors to a consideration of criminogenic needs, which are dynamic risk factors that are susceptible to change (Andrews et al., 2006). As the use of risk and needs assessment instruments has proliferated since the 1990s, so has the body of research that has examined their development and evaluated their predictive performance. More recently, studies have investigated whether risk and needs assessment tools exacerbate the racial and ethnic disparities that have long pervaded the criminal justice system.

Even though the vast majority of U.S. prison systems have been using classification assessments since the latter twentieth century, no published studies have—with the exception of original MnSafeD validation (Duwe, 2020)—evaluated their performance. Instead, several studies have focused on creating actuarial tools to predict violent or dangerous misconduct. Using data from nearly 10,000 residents in California’s prison system, Berk et al. (2006) found that random forests—a type of machine learning algorithm—performed better than several other supervised learning algorithms in predicting serious misconduct. Likewise, Cunningham and Sorensen (2006) used data on several thousand maximum security residents in Missouri’s prison system to create the Risk Assessment Scale for Prison (RASP), which was found to have strong predictive validity for violent misconduct.

More recently, two instruments were developed to assess the risk of restrictive housing placement. Helmus et al. (2019) developed and validated the Risk of Administrative Segregation Tool (RAST) on a Canadian prison population. Designed to be administered one time at intake, the static, six-item tool had a relatively high level of predictive validity in predicting administrative segregation placements within 2 years of admission. That same year, Labrecque and Smith (2019) used data from a U.S. prison population to create the Risk Assessment for Segregation Placement (RASP), which was also a static instrument designed to be administered one time at intake. The results showed the RASP performed well not only on the validation sample, but also when it was later revalidated in Oregon (Labrecque, 2022).

Present Study

To date, existing research has not validated a classification assessment instrument that has been implemented within a U.S. state prison system. While the Minnesota Severe and Frequent Estimate for Discipline (MnSafeD) was originally developed and validated in 2020 (Duwe, 2020), it was not implemented by the Minnesota Department of Corrections (MnDOC) until June 2024. This study extends the literature on institutional misconduct risk assessment by revalidating the MnSafeD, a gender-specific, fully-automated instrument that was designed to identify the small segment of the prison population that is disproportionately responsible for much of the misconduct that takes place. The present study uses multiple metrics to evaluate its predictive performance on the first 8,631 individuals scored on the instrument whose follow-up period ended before January 2025.

Consistent with recent risk assessment research, analyses are also performed to test for predictive disparities by race and ethnicity. The primary concern with a classification assessment that has predictive disparities is that it will lead to some individuals being overclassified on the basis of their race and/or ethnicity. Given how classification assessments are traditionally used, overclassified residents would be more likely to receive placement at higher-custody facilities that generally have more restrictions, less freedom of movement, and reduced access to programing.

Data and Method

Development and Validation of the MnSafeD

The MnSafeD was designed to be a gender-specific, fully-automated classification system customized to Minnesota’s prison population. The findings from prior research not only show that women tend to be overclassified when scored on the same recidivism risk assessment tools as those used for men (Van Voorhis, 2012), but also that facility-level factors can influence misconduct (Camp et al., 2003). To avoid overclassifying women, who are housed in a different facility than the men in Minnesota, the MnSafeD was designed to be gender-specific insofar as separate assessments were created for male and female residents. Moreover, because reassessments are a key design feature of classification systems (Austin, 2003), automating the scoring process has long been considered a goal, if not a best practice (Hardyman et al., 2002). Automating the scoring process not only increases reliability, which improves predictive validity (Duwe & Rocque, 2017), but it also expands assessment capacity (Duwe, 2024). Recent scholarship has demonstrated that customized, home-grown risk assessment instruments deliver strong predictive performance, especially in comparison to off-the-shelf tools (Duwe, 2024; Duwe & Rocque, 2018; Hamilton et al., 2022).

The outcome predicted by the MnSafeD is severe and/or frequent misconduct (SFM), which is defined as five or more discipline convictions, three or more discipline convictions resulting in a segregation penalty, or one or more violent misconducts against staff or other residents. Because individuals are assessed every 6 months, the MnSafeD was designed to estimate a resident’s likelihood of SFM over the next 6 months (i.e., until the next reassessment) or until the day of release for those with a release date less than 6 months away. The impetus for using this measure of misconduct derived from the observation that a relatively small segment of Minnesota’s prison population—about 10%—was responsible for a majority of the disciplinary infractions and all of the violent misconducts that take place, which dovetails with the well-known finding that a small group of prolific individuals are disproportionately responsible for much of the crime that occurs (Chaiken & Chaiken, 1984; Wolfgang et al., 1972; Wright & Rossi, 1986). From a risk management perspective, accurately predicting who will have serious, violent misconduct, and/or a lot of discipline convictions is critical. Therefore, as long as the highest-risk residents can be proactively identified, then it may be possible to improve the safety of correctional institutions by mitigating their risk for misconduct.

To develop and validate the prediction models for the MnSafeD, both data splitting (i.e., split-population) and k-fold validation methods were used. The male and female samples were split into training and test sets, with the training set comprised of individuals released from 2006 to 2009 and the test set including people released in 2010 and 2011. Consistent with the need to predict SFM in 6-month intervals, the dataset was further split into 6-month increments for both males and females. Due to attrition in the sample sizes, models were developed up to the 2-year mark for the women and up to the 4-year mark for the men.

Regularized logistic regression (RLR) was the classification algorithm used for all 12 models that were trained and tested—eight for males and four for females. To identify the items that were significant and robust predictors of SFM for each of the 12 models, a bootstrap variable selection method developed by Efron and Gong (1983) was applied to the dataset. RLR models were estimated in which predictors were added one at a time until no further single addition achieved significance level a = .10. Among the significant predictors, interaction terms were added to the models and tested to determine whether any had a statistically significant (p < .10) effect on SFM.

Bootstrap resampling was used to refine the selection of predictors for each model, and predictors were retained as long as they were statistically significant at the .05 level in at least 70% of the 1,000 bootstrap samples. After removing predictors that did not achieve statistical significance in at least 70% of the samples, another 1,000 bootstrap samples were estimated. This process was carried out for each of the 12 models. To identify the best RLR algorithm for each of the 12 prediction models, the area under the curve (AUC) was used. After identifying the parameters that yielded the highest AUC for each algorithm, predictive performance was then evaluated on the test sets.

Using multiple metrics, the results showed the models achieved a relatively high level of predictive performance. For example, the average area under the curve (AUC) was 0.832 for the female models and 0.836 for the male models. The results also indicated that predictive performance generally improved after the initial classification assessment. The improved performance among the later assessments reflects the fact these models were able to draw on more recent behavioral indicators such as misconduct and involvement in work and programing (Duwe, 2020).

Revising the MnSafeD

Following the initial development and validation of the MnSafeD (Duwe, 2020), several revisions were made prior to its debut in June 2024. First, the sample for the training and test sets was expanded to include everyone released from Minnesota prisons between 2006 and 2021. In doing so, the sample contained 10 additional years of more recent releases from prison. Moreover, the sample size increased from 39,355 releases to 102,562, of whom 91,942 were male and 10,620 were female.

Second, and perhaps more important, the design of the MnSafeD was simplified to help facilitate its implementation. Because it was created to be fully automated, information technology (IT) resources were required to implement the instrument. To help with this effort, the MnSafeD was simplified by paring down the number of statistical models within the instrument. For both the men and the women, the revised version of the MnSafeD contains the following models: (1) intake, (2) 6-month reassessment, and (3) recurring.

The intake models were designed to predict SFM from the day of admission until the 6-month mark (i.e., 6-month reclassification date) or release date for those with confinement periods less than 6 months. For residents with confinement periods greater than 6 months, the 6-month reassessment models were created to predict SFM from that date until the 12-month mark or release date for those released at some point between 6 and 12 months. For residents with confinement period greater than 1 year, the recurring reassessment models were designed to predict SFM over the next 6 months (i.e., next reassessment date) or until the day of release for those released prior to the next reassessment date.

The procedures used to create the six statistical models were the same as those described above. That is, data were split into training (2006–2016 releases) and test sets (2017–2021 releases), and k-fold validation methods were used to internally validate the models. RLR was the classification algorithm used, Efron and Gong’s (1983) bootstrap variable selection method was used to identify significant and robust predictors, and the AUC was the metric used to identify the models with the best performance. The predictive performance results were similar to what was reported by Duwe (2020).

The statistical models for the MnSafeD, which show the unstandardized coefficients for the statistically significant predictors, are presented in Table 1 for the women and Table 2 for the men. In general, there were more significant predictors identified for the men in comparison to the women. Most notably. involvement in a security threat group (STG), marital status, and idle assignments (i.e., lack of participation in programing and/or work assignments) predicted misconduct for the men but not for the women. The models also contain several interaction terms, which indicate the impact of some predictors on SFM was dependent on another item. For example, STG involvement predicted SFM in the 6-month reassessment model for the men, but only for those who had never been to prison before.

Table 1.

MnSafeD Prediction Models for Females.

Overall	Description	Intake	6-Month	Recurring
Overall	Description	B	B	B
Age at assessment	Age (years) on day of assessment	−0.014		−0.022
Total criminal convictions	Number of criminal convictions	0.059
Age at assessment × total convictions	Interaction term for age and convictions	−0.001
Age × first prison admission	Interaction term for age and first prison admit		−0.026
Drug offense specialization/diversity	Specialize in drug offending	0.455
Person offense type	Index offense is a person/violent offense	0.507
Property offense type	Index offense is a property offense	0.296
Other offense type	Index offense is “other”	0.508
Prior misconducts	Number of prior prison misconducts	0.024
Prior segregation misconducts S/D	Specialize in segregation misconducts	−0.638
Historical serious and frequent misconduct	A history of serious/frequent prison misconduct	1.469
Prior misconducts × historical SFM	Interaction term for SFM and misconducts	−0.024
Mental health criteria	Number of mental health criteria	0.354
Secondary degree	Has earned a secondary degree	−0.112
Misconducts since admission	Number of misconducts since admission		0.129	0.038
Segregation misconducts since admission	Number of seg misconducts since admission		−0.086
SFM since admission	Number of SFM since admission		1.275	0.980
No misconducts	Number of assessments periods w/out misconduct			−0.584
Length of stay at assessment	Months remaining until release/next assessment	0.367	0.656	0.591
Constant		−3.354	−5.781	−3.991

Table 2.

MnSafeD Prediction Models for Males.

Overall	Description	Intake	6-Month	Recurring
Overall	Description	B	B	B
Age at assessment	Age (years) on day of assessment	−0.041	−0.036	−0.035
Married	Married = 1; unmarried = 0	−0.322
Violent offense convictions	Number of violent offense convictions	0.040	0.070
Drug offense S/D	Specialize in drug offending	0.822
Property offense convictions	Number of property offense convictions	0.013
Robbery offense convictions	Number of robbery offense convictions	−0.061
Disorderly conduct offense convictions	Number of disorderly conduct convictions	0.126
Obstruct offense convictions	Number of obstruct offense convictions	0.054
DWI offense convictions	Number of DWI convictions	−0.114
Flee/escape offense convictions	Number of flee/escape offense convictions	0.076
Person offense type	Index offense is a person/violent offense	0.862	0.440
Sex offense type	Index offense is a sex offense	0.510
Drug offense type	Index offense is a drug offense	0.354
Property offense type	Index offense is a property offense	0.720	0.498
Other offense type	Index offense is “other”	0.679
Suicide concern	History of suicidal tendencies	0.512	0.374	0.272
Self-injury concern	History of self-injury	0.362
Security threat group (STG)	Active involvement in a STG	0.123		0.390
Secondary degree	Has earned a secondary degree	−0.253
Post-secondary degree	Has earned a post-secondary degree	−0.636	−0.234
Discharge/unsupervised release	No correctional supervision at release	0.849	0.337	0.627
Historical SFM	A history of serious/frequent prison misconduct	0.702		0.576
Prior misconducts	Number of prior prison misconducts	0.003	0.005
Prior segregation misconducts	Number of misconducts resulting in segregation		0.012
Prior segregation S/D	Specialize in segregation misconducts	−0.483
Threatening others misconduct	Number of threatening others misconducts	0.060
First prison admission × STG	Interaction term for STG and first prison admission		0.143
Idle assignments since admission	Number of idle assignments since admission		0.380	0.480
Misconducts since admission	Number of misconducts since admission		0.367	0.023
SFM since admission	Number of SFM since admission		1.132	0.834
Misconducts since admission × SFM	Interaction term for SFM and misconducts		−0.027
No misconducts since admission	Number of assessments periods without misconduct			−0.494
Length of stay at assessment	Months remaining until release/next assessment	0.637	0.406	0.440
Constant		−5.754	−4.301	−3.500

For both the men and the women, the intake models contain a number of prior criminal history predictors. Because it is necessary to assess individuals who have never been to prison before, prior criminal history provides a proxy measure for whether residents may be more likely to violate prison rules and regulations. The intake models, which predict the likelihood of meeting the criteria for SFM during the first 6 months in prison, also contain measures for offense type, prior history of misconduct, educational achievement, mental health, security threat group (STG) involvement (males only), marital status (males only), age at assessment, and length of time until reassessment or release (whichever comes first).

The 6-month reassessment models contain fewer measures of criminal offending, including index offense type and prior convictions. Instead, these models rely more on measures captured during the first 6 months in prison. In particular, the 6-month reassessment models contain a number of predictors relating to misconduct during the first 6 months in addition to an item that measures the number of times a resident was on idle status and, thus, not participating in programing or prison labor (males only). Age at assessment and length of the assessment period were also significant predictors.

The recurring models, which are applied once residents have been in prison for at least 12 months, do not contain any criminal history measures. Rather, these models rely exclusively on recent behavioral indicators that are more relevant to predicting misconduct. Moreover, both the female and male models include an item that measures whether residents have been free of discipline convictions during an assessment period. As a result, by abstaining from misconduct, residents can lower their MnSafeD score over time.

MnSafeD in Practice

All of the data used to populate the items on the MnSafeD are automatically extracted from the state’s criminal history repository and the Corrections Offender Management System (COMS), the MnDOC’s management information system. The MnSafeD was designed to automatically generate assessments on the day of intake and every 6 months thereafter. Yet, its design also allows MnDOC staff to manually generate new assessments, which generally takes less than 10 s to produce, at any time during a person’s confinement period.

When the items within the MnSafeD have been populated with data, the statistical models produce a SFM probability over the next 6 months from the day the assessment is run. For individuals who will be released in less than 6 months from the assessment date, the MnSafeD generates a SFM probability from the assessment date until the day of release. An individual’s probability is used to generate a percentile ranking, which is their MnSafeD score. Due to variations in the SFM base rate across the three types of assessments (intake, 6-month, and recurring) by gender, the percentile ranking provides a more stable metric in comparison to the predicted probability.

The percentile rankings were based on the probability distributions from the test set for the men and women. To illustrate, let us assume a male has just been admitted to prison and, thus, receives a MnSafeD score based on the intake model. A probability of 20% would place this individual at the 89th percentile. Therefore, his MnSafeD score would be 89%. For women, a probability of 20% on the intake assessment would place an individual at the 62nd percentile, which would be their MnSafeD score.

Revalidating the MnSafeD

The revalidation dataset consists of the first 8,631 residents (624 women and 8,007 men) in Minnesota’s prison system who were scored on the MnSafeD and their follow-up period ended before January 2025. The predicted outcome measure is SFM, which was operationalized as five or more discipline convictions, three or more discipline convictions resulting in a segregation penalty, or one or more violent discipline convictions during the follow-up period. For example, if an individual had two discipline convictions during the follow-up period, but one of the convictions was for violent misconduct, then they would meet the SFM criteria. On the other hand, if a person had four non-violent discipline convictions during the follow-up period, but only two of which resulted in a segregation penalty, then they would not meet the SFM criteria.

Consistent with prior research that has used multiple metrics to evaluate the predictive performance of risk assessment instruments (Duwe, 2019; Duwe & Rocque, 2017; Hamilton et al., 2016; Tollenaar & van der Heijden, 2013), this study used four different statistics to assess predictive validity. There are three main dimensions of predictive validity: 1) accuracy, 2) discrimination, and 3) calibration. For predictive accuracy, one of the more commonly-used metrics is accuracy (ACC), a threshold-based measure that looks at whether an assessment makes correct classification decisions. For example, if someone who met the SFM criteria had a predicted SFM probability less than 50%, then this individual would be incorrectly classified (i.e., false negative). Conversely, if this person had not met the SFM criteria, then they would be accurately classified (i.e., true negative). The ACC value ranges from 0% to 100%, and higher ACC values reflect greater accuracy in making correct classification decisions.

The second dimension of predictive validity, discrimination, measures the degree to which an assessment separates—in this instance—those who engage in SFM from those who do not. This study used the AUC to evaluate the MnSafeD’s predictive discrimination. One of the most widely-used statistics to evaluate predictive performance, the AUC is relatively robust across different recidivism base rates and selection ratios (Smith, 1996). With values that range from 0 to 1, the AUC statistic is interpreted as the probability that a randomly selected individual who engaged in SFM had a higher score on the MnSafeD than a randomly selected resident who did not.

Due to the relatively large sample size, especially for the men, nearly all AUC values were statistically significant at the .05 level. To assess the substantive importance of the results, this study relies on the guidelines provided by Rice and Harris (2005). Effect sizes are considered large if the AUC value is 0.714 or higher, medium if the value ranges from 0.639 to 0.713, small if the value ranges from 0.556 to 0.638, and negligible (or near random) if the value ranges from 0.500 to 0.555. In general, a risk assessment instrument is considered to have at least adequate predictive validity if the AUC value meets or exceeds 0.640.

Calibration measures how well the predicted values from a model correspond with the observed outcome being predicted. Whereas predictive discrimination assesses relative risk, calibration taps into absolute risk. In order for a prediction instrument to make accurate absolute assessments of risk, the model’s predicted values must be calibrated with the observed SFM outcomes. With values that range from 0 to 1, root mean square error (RMSE) measures the squared root of the average squared difference between observed SFM and predicted probabilities. The closer the RMSE value is to zero, the better the calibration.

In addition to these metrics, a consolidated metric was used to assess overall predictive performance. The SAR (squared error, accuracy, ROC [receiver operating characteristic]) is a combined measure of discrimination, accuracy, and calibration, and its formula is: (ACC + AUC + (1 – RMSE))/3 (Caruana et al., 2004). In previous recidivism prediction research that has used the SAR, values have ranged from 0.63 to 0.83 (Duwe, 2019; Hamilton et al., 2016; Tollenaar & van der Heijden, 2013).

To test for the presence of predictive disparities by race/ethnicity, multiple logistic regression models were estimated (Flores et al., 2016; Hamilton et al., 2021). In doing so, this study determines whether race/ethnicity had an impact on the SFM outcomes independent of the MnSafeD score. If MnSafeD is balanced among racial/ethnic groups, then only the risk score will significantly predict the SFM outcome.

Results

As shown in Table 3, 60% of the assessments were from the recurring model, 24% were from the intake model, and 16% were from the 6-month reassessment model. The results also show that 14% of the total sample met the criteria for SFM during the assessment follow-up period, which is a little higher than the reported rate in the original validation (Duwe, 2020). Women had a higher rate of SFM during the first 6 months following intake, whereas the men had a higher rate for the 6-month and recurring assessments.

Table 3.

Severe and Frequent Misconduct by Gender, Race/Ethnicity, and Model.

	Severe and frequent misconduct
>Gender and Race/Ethnicity	Intake		6-Month		Recurring
>Gender and Race/Ethnicity	Rate	N	Rate	N	Rate	N
Overall	12.1	2,046	12.1	1,419	16.0	5,166
Female	17.0	235	10.1	119	6.7	270
Black	22.6	31	40.0	10	3.6	56
Native American	18.3	71	20.0	25	12.5	56
Asian	14.3	7	0.0	4	0.0	7
Hispanic	17.7	17	0.0	11	16.7	18
White	14.7	109	4.4	69	4.5	133
Male	11.4	1,811	12.3	1,300	16.5	4,896
Black	14.6	686	19.7	473	22.1	2,061
Native American	18.9	196	16.2	117	27.4	340
Asian	4.6	44	3.0	33	2.7	149
Hispanic	10.8	83	9.8	51	14.8	263
White	7.4	802	6.7	626	10.5	2,083

The results in Table 3 also reveal disparities in SFM outcomes by race/ethnicity. For both the men and the women, residents identifying as Black and Native American had higher SFM rates than those identifying as Hispanic, Asian, and White. In fact, Asian residents generally had the lowest SFM rates, although the number of Asian women who received MnSafeD assessments was very small.

To evaluate the MnSafeD’s predictive performance, the AUC, ACC, RMSE, and SAR statistics were calculated, which are shown in Table 4. The results from all 8,631 assessments reveal the MnSafeD had an AUC of 0.84, which is well above the large effect size threshold reported by Rice and Harris (2005). In addition, the ACC value was 0.85, while the RMSE was 0.35. With SAR values ranging from 0.63 to 0.83 in prior corrections risk assessment research, the overall SAR value (0.78) is near the top end of this range.

Table 4.

MnSafeD Predictive Performance by Gender, Race, and Model.

Gender and Race/Ethnicity	Intake				6-Month								Total
Gender and Race/Ethnicity	AUC	ACC	RMSE	SAR	AUC	ACC	RMSE	SAR	AUC	ACC	RMSE	SAR	AUC	ACC	RMSE	SAR
Overall	0.714**	0.879	0.311	0.761	0.900**	0.910	0.265	0.848	0.855**	0.854	0.381	0.776	0.838**	0.851	0.348	0.780
Female	0.664**	0.823	0.371	0.705	0.756**	0.908	0.281	0.794	0.768**	0.815	0.383	0.733	0.709**	0.838	0.361	0.729
Black	0.756*	0.807	0.392	0.724	0.750	0.700	0.465	0.662	0.491	0.750	0.460	0.594	0.716*	0.763	0.440	0.680
Native Amer.	0.672	0.803	0.383	0.697	0.720	0.840	0.375	0.728	0.716	0.732	0.454	0.665	0.694*	0.783	0.410	0.689
Asian	0.833	0.857	0.344	0.782	N/A	1.000	0.187	N/A	N/A	1.000	0.032	N/A	0.882	0.944	0.232	0.865
Hispanic	0.607	0.875	0.362	0.707	N/A	0.909	0.173	N/A	0.956*	0.833	0.335	0.818	0.815*	0.867	0.315	0.789
White	0.627	0.830	0.359	0.699	0.638	0.971	0.202	0.802	0.808*	0.865	0.327	0.782	0.687**	0.881	0.316	0.751
Male	0.732**	0.886	0.303	0.772	0.912**	0.910	0.263	0.853	0.857**	0.825	0.381	0.767	0.847**	0.852	0.348	0.784
Black	0.684**	0.857	0.341	0.733	0.875**	0.863	0.326	0.804	0.823**	0.767	0.439	0.717	0.815**	0.800	0.404	0.737
Native Amer.	0.783**	0.816	0.366	0.744	0.902**	0.897	0.276	0.841	0.825**	0.771	0.438	0.719	0.835**	0.807	0.392	0.750
Asian	0.810	0.955	0.197	0.856	0.844	0.939	0.243	0.847	0.931**	0.926	0.243	0.871	0.896**	0.934	0.235	0.865
Hispanic	0.587	0.892	0.316	0.721	0.835**	0.942	0.247	0.843	0.865**	0.856	0.348	0.791	0.832**	0.874	0.330	0.792
White	0.722**	0.886	0.253	0.785	0.935**	0.944	0.205	0.891	0.868**	0.879	0.315	0.811	0.854**	0.901	0.285	0.823

Note. AUC = area under the curve; ACC = accuracy; RMSE = root mean squared error; SAR = squared error, accuracy, receiver operating characteristic.

p < .05. **p < .01.

Notwithstanding this relatively high level of predictive performance overall, there were differences across gender, assessment type, and race/ethnicity. While the predictive accuracy and calibration statistics for women and men were similar, the AUC was much higher for the men. For instance, the AUC value for the 624 women assessed on the MnSafeD was 0.71, which was just below the large effect size threshold identified by Rice and Harris (2005). The overall AUC for the men was 0.85, a large difference that was also reflected in the SAR values for the women (0.73) and men (0.78).

The results also showed the two reassessment models performed much better than the intake assessment. The overall AUC for the intake assessment was 0.714, which meets—albeit barely—Rice and Harris’s (2005) large effect size threshold. However, the AUC value was 0.90 for the 6-month reassessment and 0.86 for the recurring assessment. While the 6-month reassessment had the smallest sample size (N = 1,419), an AUC value of that magnitude connotes a very high level of predictive discrimination. Indeed, the SAR value for this assessment (0.85) is greater than the top value reported in prior corrections risk assessment research.

The MnSafeD generally performed better for individuals identifying as Asian, although the overall sample size was less than 250. On the other hand, its performance was below the overall average for Native American individuals. For people identifying as Hispanic, Black, and White, the performance was mixed across gender and the type of predictive validity metric.

While the results from Table 4 indicate the MnSafeD has relatively high predictive accuracy overall, it is also important to examine whether its performance varies significantly by race and ethnicity. To this end, a total of 12 multiple logistic regression models were estimated across gender, race/ethnicity, and assessment type. The results from these models, which are presented in Table 5, reveal an absence of significant predictive disparities by race/ethnicity. None of the race/ethnicity covariates were statistically significant, while the MnSafeD score was the only statistically significant predictor in 10 of the 12 models. Overall, the findings suggest the MnSafeD has predictive parity by race/ethnicity.

Table 5.

MnSafeD Predictive Parity by Gender, Race/Ethnicity, and Model.

Gender, Race/Ethnicity and Overall	Total		Intake		6-Months		Recurring
Gender, Race/Ethnicity and Overall	B	OR	B	OR	B	OR	B	OR
MnSafeD	4.894	133.553*	2.765	15.885**	9.399	12,082.017**	4.973	144.457**
Black	0.075	1.078	0.407	1.502	1.002	2.722	−0.032	0.968
Native American	0.630	1.877	0.211	1.235	2.808	16.578	0.361	1.435
Asian	−1.235	0.291	−2.037	0.130	1.688	5.409	−3.254	0.039
Hispanic	0.945	2.574	1.163	3.200	2.772	15.998	−0.182	0.834
Black by MnSafeD	0.157	1.171	−0.186	0.831	−0.696	0.498	0.178	1.195
NA by MnSafeD	−0.285	0.752	0.636	1.889	−2.757	0.063	−0.091	0.913
Asian by MnSafeD	0.507	1.661	2.196	8.986	−3.335	0.036	2.537	12.647
Hispanic by MnSafeD	−0.910	0.403	−1.497	0.224	−3.484	0.031	0.449	1.567
Constant	−5.241	0.005	−3.947	0.019**	−9.245	0.000**	−5.119	0.006**
Women
MnSafeD	2.094	8.114**	1.603	4.969	1.816	6.145	3.357	28.715*
Black	−0.151	0.860	−4.059	0.017	1.839	6.290	2.012	7.479
Native American	0.378	1.459	−0.204	0.816	0.564	1.757	0.974	2.649
Asian	−1.529	0.217	−1.641	0.194	−17.193	0.000	−16.314	0.000
Hispanic	−1.040	0.353	−0.026	0.974	−17.193	0.000	−28.633	0.000
Black by MnSafeD	0.363	1.437	5.131	169.145	0.940	2.559	−4.141	0.016
NA by MnSafeD	0.220	1.246	0.501	1.650	1.215	3.371	−0.463	0.629
Asian by MnSafeD	2.031	7.624	2.465	11.764	−1.816	0.163	−3.357	0.035
Hispanic by MnSafeD	1.625	5.078	−0.444	0.642	−1.816	0.163	32.174	93,997.500
Constant	−3.447	0.032**	−2.498	0.082**	−4.010	0.018**	−4.889	0.008**
Men
MnSafeD	5.230	186.717**	3.231	25.313**	11.306	81,332.423**	5.031	153.131**
Black	0.271	1.312	0.938	2.554	2.004	7.416	−0.068	0.934
Native American	0.340	1.404	−0.028	0.972	2.810	16.609	0.306	1.358
Asian	−2.618	0.073	−3.069	0.046	2.980	19.692	−3.226	0.040
Hispanic	0.680	1.973	1.535	4.642	4.492	89.288	0.004	1.004
Black by MnSafeD	−0.100	0.905	−0.787	0.455	−1.984	0.138	0.197	1.217
NA by MnSafeD	0.055	1.057	0.968	2.633	−2.882	0.056	0.025	1.025
Asian by MnSafeD	1.967	7.149	3.246	25.688	−4.638	0.010	2.446	11.540
Hispanic by MnSafeD	−0.608	0.545	−1.898	0.150	−5.359	0.005	0.197	1.217
Constant	−5.488	0.004**	−4.398	0.012**	−10.806	0.000*	−5.117	0.006**

Note. OR = odds ratio; MnSafeD = Minnesota severe and frequent estimate for discipline; NA = native American.

p < .05. **p < .01.

Discussion

The results from the MnSafeD revalidation reveal that it achieved a relatively high degree of accuracy in predicting SFM for residents in Minnesota’s prison system. Although studies on the performance of instruments designed to predict prison misconduct have been minimal, the MnSafeD’s overall AUC value (0.84) is greater than the level of predictive discrimination typically reported in research on recidivism risk assessment instruments. While the intake assessment had a large effect size overall, the reassessments (i.e., 6-month and recurring) performed much better, with AUC values exceeding 0.85. Likewise, although the MnSafeD’s performance was adequate for the women, it was substantially better for the men. The findings also indicate the MnSafeD had predictive neutrality by race and ethnicity.

The MnSafeD’s strong predictive performance is likely due to several factors. First, the MnSafeD was revalidated on a population similar to the one on which it was originally developed and validated, and prior research shows there are performance advantages associated with using a customized assessment (Duwe, 2024; Duwe & Rocque, 2018; Hamilton et al., 2022). Second, automated scoring eliminates inter-rater disagreement, and the increase in reliability leads to improved predictive accuracy. Third, rather than relying on one statistical model, or algorithm, applied at the time of intake, the use of multiple reassessment models yields better predictive performance. Indeed, the reassessment models achieved high levels of predictive accuracy by using more recent behavioral indicators and relatively few pre-prison measures such as criminal history. Finally, it is possible that operationalizing the outcome as severe and frequent misconduct, as opposed to any misconduct, may have helped enhance the instrument’s predictive performance. When the outcome was measured as any misconduct during the follow-up period, the AUC for the entire sample was 0.78. Prior studies using similar measures, such as violent misconduct (Berk et al., 2006; Cunningham & Sorensen, 2006) and misconduct resulting in segregation (Helmus et al., 2019; Labrecque, 2022; Labrecque & Smith, 2019), have reported relatively high levels of predictive validity.

The main limitation with this study is that the results reflect one state’s experience with implementing and revalidating a classification assessment system. Thus, the extent to which the findings are generalizable to other institutional corrections systems is unclear.

Implications for Research and Practice

Despite this limitation, the results from this study likely hold several implications for correctional research and practice. First, the findings provide further evidence that automated scoring and customization are critical design features that not only help increase reliability and validity, but are also important for establishing an efficient assessment process that can accommodate frequent reassessments.

Second, most of the residents in Minnesota’s prison system do not engage in any misconduct and, as shown above, only a small segment of the population has severe and/or frequent misconduct. These findings, which are consistent with what Austin (2003) reported more than two decades ago, may call into question how many different custody levels are needed within a prison system. Like many states, Minnesota has four custody levels—maximum, close, medium, and minimum. Yet, the pattern of misconduct observed in Minnesota suggests it may be possible to operate with fewer custody levels without jeopardizing the safety of residents and staff.

Third, reliable and valid classification systems can be used to improve institutional safety by not only identifying those at greatest risk of SFM, but also by taking steps to mitigate that risk. The conventional approach, at least within Minnesota, has been to place higher-risk residents in either close or maximum custody facilities, where programing resources are more limited. Moreover, to maximize the impact on recidivism, programing has generally been back-loaded toward the end of confinement. To produce better institutional and post-prison outcomes, however, higher-risk residents should be prioritized for programing at the time of intake, which would also be aligned with the risk principle from the RNR model (Andrews et al., 2006). After all, the findings from this study and prior research have shown that idle time elevates the risk for institutional misconduct (Austin, 2003; Duwe, 2020). Front-loading programing for those who are higher risk would likely decrease their chances of engaging in misconduct, which could, in turn, also lead to a more successful transition from prison to the community.

Finally, although the vast majority of state prison systems report using a classification assessment system (Duwe & Clark, 2025), there is a paucity of research on their performance. To this end, more studies are needed that evaluate the degree to which these systems are reliable and accurate in predicting misconduct. As with recent scholarship on recidivism risk assessment, future research should also examine whether classification assessments are achieving predictive neutrality across gender and race/ethnicity. Because classification systems influence the facilities where residents are placed and, by extension, the types and quantity of resources to which they may have access, determining whether individuals are being misclassified is paramount.

Footnotes

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Grant Duwe

Author Biography

Grant Duwe is the Director of Research and Evaluation for the Minnesota Department of Corrections, where he evaluates correctional programs, develops assessment instruments, and forecasts the state’s prison population. He is the author of two books and has more than 90 peer-reviewed publications on a wide variety of topics within corrections.

References

Andrews

D. A.

Bonta

Wormith

S. J.

(2006). The recent past and near future of risk and/or need assessment. Crime & Delinquency, 52(1), 7–27.

Austin

(2003). Findings in prison classification and risk assessment. National Institute of Corrections.

Berk

R. A.

Kriegler

Baek

(2006). Forecasting dangerous inmate misconduct: An application of ensemble statistical procedures. Journal of Quantitative Criminology, 22, 131–145. https://doi.org/10.1007/s10940-006-9005-z

Brower

(2013). Correctional officer wellness and safety literature review. Office of Justice Programs, U.S. Department of Justice.

Camp

S. D.

Gaes

G. G.

Langan

N. P.

Saylor

W. G.

(2003). The influence of prisons on inmate misconduct: A multilevel investigation. Justice Quarterly, 20, 501–533.

Caruana

Niculescu-Mizil

Crew

Ksikes

(2004, July 4–8). Ensemble selection from libraries of models [Conference session]. 21st International Conference on Machine Learning, Canada, Banff (pp. 1–12).

Chaiken

M. R.

Chaiken

(1984). Offender types and public policy. Crime and Delinquency, 30, 195–226.

Clark

Duwe

(2024). Sex differences in the effects of adverse childhood experiences on institutional misconduct among adults in prison. Journal of Interpersonal Violence, 40(1–2), 308–337.

Cunningham

M. D.

Sorensen

J. R.

(2006). Actuarial models for assessing prison violence risk: Revisions and extensions for the risk assessment scale for prison (RASP). Assessment, 13(3), 253–265. https://doi.org/10.1177/1073191106287791

10.

Duwe

(2019). Better practices in the development and validation of recidivism risk assessments: The Minnesota Sex Offender Screening Tool-4 (MnSOST-4). Criminal Justice Policy Review. 30(4), 538–564. https://doi.org/10.1177/0887403417718608

11.

Duwe

(2020). The development and validation of a classification system for severe and frequent misconduct. The Prison Journal, 100(2), 173–200. https://doi.org/10.1177/0032885519894587

12.

Duwe

(2024). Evaluating bias, shrinkage, and the home-field advantage: Results from a revalidation of the MnSTARR 2.0. Corrections: Policy, Practice and Research, 9(1), 20–42. https://doi.org/10.1080/23774657.2021.2011802

13.

Duwe

Clark

(2025). The gap between the ideal and reality in risk-needs-responsivity assessments: Results from a survey of U.S. state prison systems. Corrections: Policy, Practice and Research. Advance online publication. https://doi.org/10.1080/23774657.2025.2533132

14.

Duwe

Clark

McNeeley

(2023). Does criminal thinking predict prison misconduct? An evaluation of TCU’s Criminal Thinking Scales. Criminal Justice and Behavior, 50(6), 830–848. https://doi.org/10.1177/00938548231163111

15.

Duwe

Hallett

Hays

Jang

S. J.

Johnson

B. R.

(2015). Bible college participation and prison misconduct: A preliminary analysis. Journal of Offender Rehabilitation, 54, 371–390.

16.

Duwe

Rocque

(2017). The effects of automating recidivism risk assessment on reliability, predictive validity, and return on investment (ROI). Criminology & Public Policy, 16, 235–269.

17.

Duwe

Rocque

(2018). The home-field advantage and the perils of professional judgment: Evaluating the performance of the Static-99R and the MnSOST-3 in predicting sexual recidivism. Law and Human Behavior, 42, 269–279.

18.

Efron

Gong

(1983). A leisurely look at the bootstrap, the jackknife, and cross-validation. American Statistician, 37, 36–48.

19.

Flores

A. W.

Bechtel

Lowenkamp

C. T.

(2016). False positives, false negatives, and false analyses: A rejoinder to machine bias: There’s software used across the country to predict future criminals, and it’s biased against Blacks. Federal Probation, 80, 38–46.

20.

French

S. A.

Gendreau

(2006). Reducing prison misconducts: What works! Criminal Justice and Behavior, 33, 185–218.

21.

Gaes

G. G.

Wallace

Gilman

Klein-Saffran

Suppa

(2002). The influence of prison gang affiliation on violence and other prison misconduct. The Prison Journal, 82, 359–385.

22.

Gendreau

Goggin

C. E.

Law

M. A.

(1997). Predicting prison misconduct. Criminal Justice and Behavior, 24, 414–431.

23.

Gendreau

Little

Goggin

(1996). A meta-analysis of the predictors of adult offender recidivism: What works! Criminology, 34, 575–607.

24.

Gover

A. R.

Perez

D. M.

Jennings

W. G.

(2008). Gender differences in factors contributing to institutional misconduct. The Prison Journal, 88, 378–403.

25.

Griffin

Hepburn

(2006). The effect of gang affiliation on violent misconduct among inmates during the early years of confinement. Criminal Justice and Behavior, 33, 419–448.

26.

Hamilton

Duwe

Kigerl

Gwinn

Langan

Dollar

(2021). Tailoring to a mandate: The development and validation of the Prisoner Assessment Tool Targeting Estimated Risk and Needs (PATTERN). Justice Quarterly, 39(6), 1129–1155. https://doi.org/10.1080/07418825.2021.1906930

27.

Hamilton

Kigerl

Campagna

Barnoski

Lee

Van Wormer

Block

(2016). Designed to fit: The development and validation of the STRONG-R recidivism risk assessment. Criminal Justice and Behavior, 43(2), 230–263.

28.

Hamilton

Kigerl

Kowalski

(2022). Prediction is local: The benefits of risk assessment optimization. Justice Quarterly, 39, 722–744.

29.

Hardyman

P. L.

Austin

Tulloch

O. C.

(2002). Revalidating external prison classification systems. National Institute of Corrections.

30.

Helmus

L. M.

Johnson

Harris

A. J. R.

(2019). Developing and validating a tool to predict placements in administrative segregation: Predictive accuracy with inmates, including indigenous and female inmates. Psychology, Public Policy, and Law, 25(4), 284–302. https://doi.org/10.1037/law0000201

31.

Johnson Listwan

Colvin

Hanley

Flannery

(2010). Victimization, social support, and psychological well-being: A study of recently released prisoners. Criminal Justice and Behavior, 37, 1140–1159.

32.

Labrecque

R. M.

(2022). Security threat management in prison: Revalidation and revision of the inmate risk assessment for segregation placement. The Prison Journal, 102(1), 47–63.

33.

Labrecque

R. M.

Smith

(2019). Reducing institutional disorder: Using the inmate risk assessment for segregation placement (RASP) to triage treatment services at the front-end of prison sentences. Crime and Delinquency, 65(1), 3–25. https://doi.org/10.1177/0011128717748946

34.

Lovell

Jemelka

(1996). When inmates misbehave: The costs of discipline. The Prison Journal, 76(2), 165–179. https://doi.org/10.1177/0032855596076002004

35.

Reale

K. S.

Usman

Rodriquez

(2024). An examination of prison-based programming and prison misconduct. Criminal Justice and Behavior, 52(3), 429–446. https://doi.org/10.1177/00938548241302474

36.

Rice

M. E.

Harris

G. T.

(2005). Comparing effect sizes in follow-up studies: ROC Area, Cohen’s d, and r. Law and Human Behavior, 29(5), 615–620.

37.

Smith

(1996). The effects of base rate and cutoff point choice on commonly used measures of association and accuracy in recidivism research. Journal of Quantitative Criminology, 12, 83–111.

38.

Steiner

Wooldredge

(2014). Sex differences in the predictors of prisoner misconduct. Criminal Justice and Behavior, 41, 433–452.

39.

Tewksbury

Connor

D. P.

Denney

A. S.

(2014). Disciplinary infractions behind bars: An exploration of importation and deprivation theories. Criminal Justice Review, 39, 201–218.

40.

Tollenaar

van der Heijden

P. G. M.

(2013). Which method predicts recidivism best? A comparison of statistical, machine learning and data mining predictive methods. Journal of the Royal Statistical Society, Series A 176(part 2), 565–584.

41.

Van Voorhis

(2012). On behalf of women offenders: Women’s place in the science of evidence-based practice. Criminology & Public Policy, 11, 111.

42.

Wolfgang

M. E.

Figlio

R. M.

Sellin

(1972). Delinquency in a birth cohort. University of Chicago Press.

43.

Wright

J. D.

Rossi

P. H.

(1986). Armed and considered dangerous: A survey of felons and their firearms. Aldine de Gruyter.