Reliability of Differential Item Functioning in Alcohol Use Disorder: Bayesian Meta-Analysis of Criteria Discrimination Estimates

Abstract

Numerous studies leverage item response theory (IRT) methods to examine measurement characteristics of alcohol use disorder (AUD) diagnostic criteria. Less work has examined the consistency of AUD IRT parameter estimates, an essential step for establishing measurement invariance, making statements about symptom diagnosticity, and validating the theoretical construct. A Bayesian meta-analysis of IRT discrimination values for AUD criteria across 33 independent samples (Total N = 321,998) revealed that overall consistency of AUD criteria discriminations was low (generalized intraclass correlation range = .105-.249). However, specific study characteristics accounted for substantial variability, suggesting that the unreliability is partially systematic. We replicated evidence of differential item functioning (DIF) via established factors (e.g., age, gender), but the magnitudes were small compared with DIF associated with assessment instrument. These results offer practical recommendations regarding which instruments to use when specific AUD criteria are of interest and which criteria are most sensitive when comparing demographic groups.

Keywords

alcohol use disorder Bayesian meta-analysis differential item functioning generalizability theory item response theory

The diagnosis and conceptualization of alcohol use disorder (AUD) has gone through substantial revision since the syndrome was first included among the personality disorders in earlier versions of the Diagnostic and Statistical Manual of Mental Disorders (i.e., DSM-I and DSM-II; American Psychiatric Association [APA], 1942, 1952).¹ In the current iteration of the DSM–Fifth edition (DSM-5; APA, 2013), AUD is diagnosed based on the presence of 2 or more of 11 different criteria that assess cognitive, behavioral, and physiological symptoms with severity being determined by the number of criteria endorsed. As statistical tools have become more advanced in recent decades, researchers have leveraged these advancements to further refine the field’s understanding of AUD. One such advancement has been the application of item response theory (IRT) to studying the AUD criteria.

Numerous IRT studies have been conducted on the AUD criteria set as instantiated in the 4th and 5th editions of the DSM (APA, 1994, 2013). The most common IRT model used in AUD research is the two-parameter logistic (2PL) model. The 2PL model estimates two parameters for each criterion based on a logistic model of (usually) binary endorsement: a threshold and discrimination parameter. Thresholds relate to how difficult it is to endorse a particular criterion. Consequently, many clinical measures are designed to include items/criteria that fall along a continuum of difficulty to ensure that the measure provides information about individuals who fall along various points of the latent spectrum (Reise & Waller, 2009). Within AUD research, studies have shown that certain AUD criteria like tolerance (having to drink a greater amount to achieve similar effects) tend to be relatively easy to endorse, while others such as withdrawal (insomnia/sweating/tremors/nausea within a few hours or days after stopping drinking) are more difficult (e.g., Martin et al., 2006).

In comparison, discrimination parameters relate to a criterion’s ability to distinguish between persons given their standing on the latent AUD spectrum. Higher discrimination values indicate that an item or criterion is better able to differentiate between persons at similar levels of the latent spectrum while lower values suggest that a criterion provides less information about individuals across varying levels of the latent spectrum. That is, a highly discriminating criterion is associated with a high probability of identifying individuals who will, versus will not, endorse the criterion within a narrow range along the latent spectrum. In contrast to criteria thresholds, past research suggests that both tolerance and withdrawal provide comparatively less information about latent AUD precision whereas interpersonal problems and giving up activities due to drinking have high discriminations (e.g., Martin et al., 2006; Saha et al., 2006). It is important to note that while thresholds often receive preferential attention given their association with symptom severity (Balsis et al., 2007; Cooper & Balsis, 2009; Langenbucher et al., 2004; Uebelacker et al., 2009), focused attention on any criterion given its relative severity is (arguably) only warranted to the extent that the corresponding discrimination is also high.

IRT studies of AUD criteria have addressed a range of research questions. For example, IRT-based results have shown that the AUD criteria set assesses a single latent dimension (e.g., Harford et al., 2009; Hasin et al., 2013), with some indication that it is most effective at identifying severe cases along the AUD continuum (Saha et al., 2006). Other studies suggest that AUD criteria are most informative for individuals along the moderate portion of the AUD latent spectrum (e.g., Borges et al., 2010). Despite the relatively large number of IRT studies on AUD criteria, less work has examined the overall consistency of the observed threshold and discrimination estimates across studies. In their recent meta-analysis of IRT threshold estimates, Lane et al. (2016) found notably low reliability in AUD criteria thresholds, and also found important sources of systematic variability in threshold values (e.g., measurement instruments and age group). The present study extends similar analyses to examine the between-study consistency of IRT-derived discrimination parameters, and examine competing hypotheses regarding the perceived consistency of AUD criteria discrimination values. Such analyses are critical for arguments asserting the generalizability of AUD phenomenology and diagnosis across demographic subpopulations.

IRT Analyses of AUD Criteria

An important topic within IRT analyses of the AUD criteria is the presence (or lack thereof) of differential item functioning (DIF). DIF analyses have tended to focus on whether the discrimination and/or threshold values for AUD criteria show variability across relevant sample characteristics (e.g., gender, age group, ethnicity, etc.). The presence of DIF for discrimination estimates may indicate that an AUD criterion’s ability to distinguish between individuals that fall along different points of the underlying AUD spectrum is dependent on factors that may differ across samples.

Important instances of DIF have been noted in the AUD IRT literature. For example, when examining the discriminations of AUD criteria in a sample of adolescents, the quit/cut down criterion showed relatively poor discrimination (Martin et al., 2006), consistent with other work that has found that adolescents with more severe alcohol abuse problems tend to not endorse trying to cut down while other adolescents with less severe problems do report attempts to limit their use (Chung & Martin, 2002).

In their analysis of nationally representative data from the National Epidemiological Survey of Alcohol and Related Conditions, Saha et al. (2006) found DIF across men and women, age groups, and ethnicity. With regard to discrimination values, the 24- to 44-year-old age group showed lower discrimination values compared with the ≥45 years old and 18- to 24-year-old age groups for the tolerance and hazardous use criteria. When considering ethnicity, African American participants showed lower discriminations for the tolerance, quit/cut down, and hazardous use criteria compared with White participants. By contrast, using another representative data set from the National Survey on Drug Use and Health across gender, age, and ethnicity, Harford et al. (2009) reported patterns not wholly consistent with the criteria identified by Saha et al. (2006; i.e., tolerance and hazardous use) while also reporting DIF across gender and age groups not identified by the those authors (e.g., giving up activities, role interference). When focusing on the specific AUD symptom of dependence across four studies, Witkiewitz et al. (2016) found that certain assessment items tied to alcohol dependence showed differential functioning across relevant sample characteristics. For example, the authors found that items related to physiological dependence and delirium tremens were less useful in assessing the degree of dependence severity in younger individuals and females, respectively.

Given that numerous studies have examined DIF with regard to discrimination estimates within studies, the consistency of DIF at a more global level can help inform future efforts in establishing whether AUD criteria discriminations from any given study are generalizable, and if not, what sources may be limiting the generalizability of the estimates.

The Current Study

Though IRT has been utilized to explore differences in threshold and/or discrimination values within the AUD criteria set, less work has examined the between-study consistency of the AUD criteria discriminations. Discrepancies in discrimination estimates across studies can provide important insights into diagnoses of AUD. Most important, in cases where the criteria discriminations are not comparable across different types of groups (e.g., gender, ethnicity, etc.), it suggests that the underlying composition of the latent AUD trait may differ. The implications of such findings would suggest that not only can the presentation of AUD look different across groups even when similar symptoms are endorsed, but that the underlying etiology may also differ.

The current study looks to more critically examine the overall consistency in AUD criterion discriminations across studies. Previous work has been conducted examining the consistency of threshold estimates (Lane et al., 2016) and the present study looks to apply the same analytical framework to examinations of discriminations. Specifically, we aim to test whether previously reported sources of DIF within samples (e.g., gender, ethnicity, age group) as well as commonly examined meta-analytic moderators (e.g., clinical vs. nonclinical samples, measurement instrument) emerge as systematic sources of discrimination variability. Given previous DIF findings regarding AUD discrimination estimates, we hypothesize that there will be notable sources of systematic variability in discrimination estimates, which will limit the likelihood that the AUD discrimination estimates will be consistent across studies. However, debates remain regarding whether the AUD criteria consistently distinguish between individuals at differing levels of AUD severity, and whether certain criteria are more discriminating than others (e.g., Martin et al., 2011; Martin et al., 2014; Sayette, 2016). The argument that the criteria consistently distinguish between individuals would lead to the prediction that the AUD criteria will be the primary source of variability in discrimination values across studies, especially if criteria differences reported by epidemiological investigations are to be considered generalizable (cf. Harford et al., 2009; Saha et al., 2006).

Method²

Literature Search

The present literature search incorporated all results returned from the initial publication conducted in May of 2015 (see Lane et al., 2016, for a full review). Briefly, the PubMed, Web of Science, ProQuest (including dissertations), and PsycINFO databases were searched with a time restriction of January 1977 to May 2015. The search terms entered were “item response theory” or “differential item functioning” crossed with “alcohol use disorder,” “alcohol abuse,” or “alcohol dependence.” Inclusion criteria for studies returned by these search terms were that the paper needed to report on discrimination and severity parameter estimates from IRT analyses that assessed DSM-III, DSM-III-R, DSM-IV, DSM-5, ICD-9, or ICD-10 AUD criteria, and that the results were derived from analyses of 2PL. In order to include recent studies that had used IRT analyses since the publication of the original meta-analysis, an updated search was conducted following the exact procedures outlined above with the exception that the search aimed to identify studies published between May of 2015 and May of 2018.

The updated search returned 74 articles for review. After removing duplicate articles (n = 30), a review of the remaining 44 articles identified 3 new articles to be added, resulting in a total of 33 nonoverlapping articles (with 52 samples providing discrimination estimates) included in our quantitative analyses. Though the number of included articles could be considered moderate, the total sample size (N = 321,998) for the present study was extremely large, even by meta-analytic standards. A flow chart of the literature search results is presented in Figure 1, which maps the path of study inclusion over the four phases of our systematic review: identification, screening, eligibility, and inclusion. Studies included in the meta-analysis are marked with an asterisk in the references.

Figure 1.

Flow diagram of updated literature search results.

Coded Information

In addition to the AUD criteria discrimination values, the information taken from each article included the study authors, year of publication, sample characteristics (i.e., mean age, clinical vs. population, gender ratio, etc.), sample size, AUD diagnostic instrument, diagnostic time frame, number of criteria assessed, and reporting metric for the discrimination values (unstandardized, standardized, IRT parameterized).³ The original data were coded by two independent raters and the raters showed near perfect agreement for the discrimination and severity values (intraclass correlations [ICCs] = .98-1.00) and showed good to excellent agreement for descriptive information taken from the studies (κ = .86-1.00). Due to only three unique studies being included from the updated literature search, the first and second author independently coded the three studies and results showed perfect agreement across all coded information.

For publications that examined DIF, we included discrimination estimates for each respective group in the analyses. We did not include aggregated estimates if such an analysis was conducted since they would be partially confounded with the individual DIF results. In this way, specific DIF identified in different publications was not explicitly examined, but was included and appropriately weighted to estimate meta-analytic evidence of DIF. We considered whether DIF in general across the criteria set existed, and, if so, for what criteria and factors was it robust across the extant literature.

Analyses

Our primary interest was in the overall reliability of the discrimination estimates for the AUD criteria set across studies. As such, we made use of GT (Brennan, 1992) to estimate generalized ICCs (Shrout & Fleiss, 1979) within a Bayesian framework. Two models were examined. The first model is expressed in Equation 1,

P_{c s} = μ + C_{c} + S_{s} + e_{c s}

(1)

where P_cs represents the discrimination estimate for criterion c from study s. µ represents the grand mean of all criteria discrimination estimates. C_c is the tendency for criterion c to be more or less discriminative across samples while S_s is the tendency for study s to produce higher or lower discriminations across the criteria set on average. The reliability of the estimated discrimination values for a fixed criterion set (i.e., C_c from Equation 1) for any randomly selected study can be examined using Equation 2:

R_{1 F} = \frac{σ_{C R I T E R I O N}^{2}}{σ_{C R I T E R I O N}^{2} + σ_{E R R O R}^{2}}

(2)

Equation 2 provides a reliability estimate between 0 and 1, with higher values indicating a greater level of reliability for the AUD criteria set. In the present context, reliability refers to the ability of the criteria set to consistently account for variability observed in discrimination values across studies, relative to other sources of variability. In the basic model, the other sources of variability are aggregated into either a study component or error component.

The basic model in Equation 1 was then expanded to estimate additional variance components for diagnostic instrument, diagnosis time frame, sample type, gender, and age along with their interactions with AUD criteria (Equation 3).

\begin{array}{l} P_{c s i t n m a g} = μ + C_{c} + S_{s} + I_{i} + T_{t} + N_{n} + M_{m} + A_{a} \\ + G_{g} + {(C I)}_{c i} + {(C T)}_{ct} + {(C N)}_{cn} + {(C M)}_{c m} \\ + {(C A)}_{c a} + e_{c s i t n m a g} \end{array}

(3)

In the expanded model P_csitnmag represents the discrimination parameter estimate for criterion, c, from study, s, where study was indexed by specific instrument (i; Seven categories: Alcohol Use Disorder and Associated Disabilities Interview Schedule [AUDADIS]; Composite International Diagnostic Interview [CIDI]; Semistructured Assessment for the Genetics of Alcoholism [SSAGA]; Substance Abuse and Mental Health Services Administration [SAMHSA]; Psychiatric Research Interview for Substance and Mental Disorders [PRISM]; Structured Clinical Interview for DSM-IV [SCID]; Other), AUD diagnostic time frame (t; Two categories: current, lifetime), sample composition (n; Four categories: clinical, healthy, mixed, population), gender distribution (m; Five Categories: exclusively men, primarily men, approximately equal men and women, primarily women, exclusively women), age distribution (a; 5 categories: <18 years, primarily 18-30 years, primarily, 30-50 years, >50 years, representative of the population 18 years and older), and being part of a group of studies that used the same or a partially overlapping sample (g). µ, C_c, and S_s are the same as in Equation 1, but now we include effects for instrument (I_i), diagnosis time frame (T_t), sample population (N_n), gender composition (M_m), age group (A_a), and being part of a group of studies that used overlapping samples (G_g). Two-way interaction terms are also included to examine additional sources of systematic variance in reliability. In the expanded model, the consistency of criterion discriminations for any randomly selected study that is not attributable to instrument, time frame, population, gender, or age can be examined using Equation 4.

R_{1 R} = \frac{σ_{C R I T E R I O N}^{2}}{(\begin{matrix} σ_{C R I T E R I O N}^{2} + σ_{C R I T E R I O N * I N S T R U M E N T}^{2} + \\ σ_{C R I T E R I O N * T I M E F R A M E}^{2} + σ_{C R I T E R I O N * C L I N I C A L}^{2} + \\ σ_{C R I T E R I O N * G E N D E R}^{2} + σ_{C R I T E R I O N * A G E}^{2} + σ_{E R R O R}^{2} \end{matrix})}

(4)

Similar to Equation 2 above, Equation 4 provides a reliability estimate between 0 and 1, with higher values indicating that the criteria set account for the most variability in discrimination estimates across studies, relative to other potential sources of variability (e.g., an interaction between a particular criterion and measurement instrument). The analyses described were all completed using a Bayesian approach. Sample sizes from each included analysis were used to weight sets of criteria discriminations to adjust for differences in precision when calculating met-analytic estimates and confidence intervals (DuMouchel, 1990).

Although the various benefits of Bayesian methods have been emphasized elsewhere (e.g., Kruschke & Liddell, 2018; Wagenmakers et al., 2016), we chose to rely on Bayesian analyses in the present study in order to test different beliefs regarding whether the AUD criteria set would be the primary source of variability in discrimination estimates across studies.⁴ We made use of the Savage-Dickey method (Wagenmakers et al., 2010), allowing for approximate Bayes factors (BF) to be computed. The Savage-Dickey method provides an approximate BF by comparing the prior and posterior distribution densities at a specified point estimate (typically zero, the null hypothesis). In turn, the BF describes how much more (or less) likely the point estimate is after having seen the data. Thus, Bayesian analysis allows for direct quantification of evidence for or against particular point hypotheses.

In the context of the present study, BFs were computed to test point hypotheses regarding the amount of variability in discrimination estimates that could be accounted for by the AUD criteria set. Specifically, we examined the influence of two different types of prior specifications for the σ2criterion parameter.⁵ First, we used moderately informed priors (σ2criterion ~ N [.60, .035]) that reflect some degree of confidence that the AUD criteria will be the primary source of variability in discrimination estimates before having seen the data. Next, we used strongly informed priors (σ2criterion ~ N [.60, .001]) that reflect very strong beliefs that the AUD criteria set will be the primary source of variability in discrimination estimates. By comparing the likelihoods of σ2criterion parameter estimates under the prior and posterior distributions at the point estimate of .60, BFs can be computed to quantify the degree of evidence in favor of (or against) beliefs that the AUD criteria set will account for a moderate to large degree of variability (i.e., σ2criterion = .60) in discrimination estimates across studies.

Results

Figure 2 displays the plotted IRT discrimination estimates for raw and standardized values. The criteria are plotted in ascending order according to their median discrimination value across studies (Table 1). Both the raw and standardized discrimination values show notable degrees of systematic unreliability in that, (1) there is marked variability within each criterion category across studies, and (2) there are multiple patterns of opposing change (larger/longer, cut down, role interference, time spent, interpersonal problems) across the different continuous study series. However, the standardized data show an increasing linear trend in discrimination values, suggesting some degree of reliability in the criteria. The differences between the unstandardized and standardized results is a consequence of differing within-study means and variances, which are eliminated when standardized data are used due to all discrimination estimates being placed on the same scale. The choice to use standardized results may be justifiable if the between-study differences in scaling is viewed as nuisance variance as opposed to being indicative of meaningful study-specific variability. Therefore, although we primarily focus on unstandardized values for interpretation, we present results for both unstandardized and standardized discrimination values in Table 2.⁶

Figure 2.

Unstandardized (a) and standardized (b) discrimination estimates for AUD criteria.

Table 1.

Descriptive Information for Included Studies.

Article	Sample^a	Instrument^b	Sample size	Time frame	# Criteria
Casey et al. (2012); Lane and Sher (2015)^c	NESARC Wave 2	AUDADIS-IV	22,177^g	Current	11
Dawson et al. (2010) ^d	NESARC Wave 1	AUDADIS-IV	26,946^g	Current	11
Saha et al. (2006) ^d	NESARC Wave 1	AUDADIS-IV	22,526^h	Current	10
Saha et al. (2007) ^d	NESARC Wave 1	AUDADIS-IV	20,846ⁱ	Current	11
Shmulewitz et al. (2010) ^d	Israeli households	AUDADIS-IV	1,066	Current	11
Shmulewitz et al. (2010) ^d			1,160	Lifetime	11
Keyes et al. (2011)	NLAES	AUDADIS-IV	18,352ⁱ	Current	12
Preuss et al. (2014)	WHO/ISBRA	AUDADIS-based	711	Lifetime	11
	Australia		104
	Brazil		212
	Canada		227
	Finland		86
	Japan		82
Mewton et al. (2011a); Proudfoot et al. (2006)	NSMHWB	CIDI V2.0	7,746	Current	11
Mewton et al. (2011b)	NSMHWB	CIDI V2.0	853^j	Current	11
McCutcheon et al. (2011)	COGA	SSAGA	8,605	Lifetime	9
	Non-DUI Men		3,056
	Non-DUI Women		3,894
	DUI Men		1,330
	DUI Women		325
Beseler et al. (2010)	College students	Survey-specific	353	Current	10^o
Hagman (2017) ^c	College Students	Brief DSM-5 AUD Assessment	923^g	Current	11
Hagman and Cohn (2011)	College students	CIDI-SAM	396	Current	11
Ehlke et al. (2012)	NSDUH 2009	SAMHSA	4,605^k	Current	11
Kuerbis et al. (2013b)	NSDUH 2009	SAMHSA	3,412^l	Current	11
Hagman and Cohn (2013)	NSDUH 2009	SAMHSA	3,806^m	Current	11^p
Rose et al. (2012)	NSDUH 2002-2008	SAMHSA	9,356ⁿ	Current	11
Harford et al. (2009)	NSDUH 2002-2005	SAMHSA	133,231	Current	11
	Men, Age 12-17 years		11,651
	Men, Age 18-25 years		27,377
	Men, Age 26+ years		25,872
	Women, Age 12-17 years		12,304
	Women, Age 18-25 years		29,331
	Women, Age 26+ years		26,696
Srisurapanont et al. (2012)	Thai-NMH Survey	MINI-Thai	3,718	Current	7
	Men		3,174
	Women		544
	Adolescents		272
	Adults		3,446
Duncan et al. (2011)	MOAFTS	SSAGA	2,835	Lifetime	11
	Women, Age 18-20		1,158
	Women, Age 21-25		1,677
Derringer et al. (2013)	MTFS & SAGE	SSAGA	6,597	Lifetime	7
Gilder et al. (2011) ^e	American Indians	SSAGA	530	Lifetime	10
Gelhorn et al. (2008)	Mixed Adolescents^f	CIDI-SAM	5,587	Lifetime	11
Caetano et al. (2016) ^c	Puerto Rican adults	CIDI-Spanish	1,107^g	Lifetime	11
Castaldelli-Maia et al. (2015) ^c	Brazilian adults	CIDI-Portuguese	936ⁱ	Current	11
Bond et al. (2012); Borges et al., (2010, 2011); Cherpitel et al. (2010)	ED patients	CIDI V1.0	3,191	Current	12
	Argentina		662
	Mexico		547
	Poland		1,098
	USA		884
Hasin et al. (2012) ^d	Clinical	PRISM	543	Current	11
Langenbucher et al. (2004)	Clinical	CIDI-SAM	372	Lifetime	9
Wu et al. (2009)	Clinical	DSM-IV checklist	462	Current	7
Wu et al. (2012)	Clinical	DSM-IV checklist	671	Current	7
Martin et al. (2006)	Clinical adolescents	SCID	464	Lifetime	11
Edwards et al. (2013)	VATSPSUD	SCID	7,454	Lifetime	11
Kuerbis et al. (2013a)	SARD	SCID	461	Lifetime	11

NESARC = National Epidemiological Study on Alcohol and Related Conditions; NLAES = National Longitudinal Alcohol Epidemiologic Study; WHO/ISBRA = World Health Organization/International Society on Biomedical Research Collaborative Study; NSMHWB = National Survey of Mental Health and Well-Being (Australia); COGA = Collaborative Study on the Genetics of Alcoholism; NSDUH = National Survey on Drug Use and Health; Thai-NMH Survey = Thai National Mental Health Survey; MOAFTS = Missouri Adolescent Female Twin Study; MTFS = Minnesota Twin Family Study; SAGE = Study of Addiction: Genes and Environment; ED = Emergency Department, VATSPSUD = Virginia Adult Twin Study of Psychiatric and Substance Use Disorders; SARD = Substance Abuse Research Demonstration. ^bAUDADIS-IV = Alcohol Use Disorder and Associated Disabilities Interview Schedule–IV; CIDI = Composite International Diagnostic Interview; SSAGA = Semistructured Assessment for the Genetics of Alcoholism; SAMHSA = Substance Abuse and Mental Health Services Administration; SAM = Substance Abuse Module; MINI-Thai = Mini International Neuropsychiatric Inventory, Thai module; PRISM = Psychiatric Research Interview for Substance and Mental Disorders; SCID = Structured Clinical Interview for DSM-IV. ^cAdditional studies included in the updated analysis. ^dWe could not confirm the reported metric for the item response theory (IRT) parameters with the authors, but based on the description and software used an IRT parameterization seemed likely. Differential item functioning analysis estimates could also not be acquired, so the aggregate results were used. ^eWe selected the authors’ “once per month” binge drinking criteria for comparison of the IRT thresholds. Using the other criteria resulted in trivially different associations. ^fCombination of community, adjudicated, and clinical individuals. ^gPast year drinkers. ^h≥12 Drinks in past year and ever drank 5+ drinks on ≥1 occasion. ⁱ≥12 Drinks in past year. ^jYoung adult (18-24 years) subsample only. ^kCollege students. ^lAge 50+ years. ^mNoncollege, Age 18-25 years. ⁿAdolescent and young adult drinkers (12-21yrs) only. ^oAuthors created a combined measure of interpersonal and legal problems criteria. ^pTolerance severity not reported.

Table 2.

Bayesian Variance Parameter Estimates for Basic and Expanded Models Using Unstandardized and Standardized Discrimination Values.

Parameter	Basic model (Equation 1)		Expanded model (Equation 3)
Parameter	Median estimate (SD)	95% HDI	Median estimate (SD)	95% HDI
Unstandardized estimates
$σ_{C R I T E R I O N}^{2}$	0.066 (0.043)	[0.021, 0.156]	0.031 (0.036)	[0.002, 0.107]
$σ_{C R I T E R I O N}^{2}$	0.419 (0.099)	[0.322, 0.746]	0.217 (0.086)	[0.073, 0.393]
$σ_{C R I T E R I O N}^{2}$			0.105 (0.365)	[0.001, 0.803]
$σ_{C R I T E R I O N}^{2}$			0.109 (174.3)^a	[0.001, 2.57]
$σ_{C R I T E R I O N}^{2}$			0.081 (0.617)	[0.002, 0.754]
$σ_{C R I T E R I O N}^{2}$			0.035 (0.419)	[0.002, 0.448]
$σ_{C R I T E R I O N}^{2}$			0.108 (0.527)	[0.002, 0.787]
$σ_{C R I T E R I O N}^{2}$			0.096 (0.104)	[0.002, 0.322]
$σ_{C R I T E R I O N}^{2}$			0.073 (0.024)	[0.034, 0.124]
$σ_{C R I T E R I O N}^{2}$			0.017 (0.015)	[0.002, 0.050]
$σ_{C R I T E R I O N}^{2}$			0.012 (0.010)	[0.002, 0.035]
$σ_{C R I T E R I O N}^{2}$			0.008 (0.006)	[0.002, 0.022]
$σ_{C R I T E R I O N}^{2}$			0.006 (0.004)	[0.002, 0.015]
$σ_{C R I T E R I O N}^{2}$	0.199 (0.013)	[0.174, 0.225]	0.105 (0.011)	[0.119, 0.161]
ICC
Estimate (95% HDI)	0.249 [0.105, 0.448]		0.105 [0.010, 0.294]
Standardized estimates
$σ_{C R I T E R I O N}^{2}$	0.438 (0.299)	[0.135, 1.089]	0.189 (0.171)	[0.006, 0.543]
$σ_{C R I T E R I O N}^{2}$			0.202 (0.059)	[0.106, 0.327]
$σ_{C R I T E R I O N}^{2}$			0.021 (0.032)	[0.002, 0.085]
$σ_{C R I T E R I O N}^{2}$			0.036 (0.031)	[0.003, 0.104]
$σ_{C R I T E R I O N}^{2}$			0.019 (0.017)	[0.002, 0.055]
$σ_{C R I T E R I O N}^{2}$			0.016 (0.013)	[0.002, 0.044]
$σ_{C R I T E R I O N}^{2}$	0.582 (0.037)	[0.514, 0.656]	0.399 (0.029)	[0.345, 0.458]
ICC
Estimate [95% HDI]	0.430 [0.232, 0.676]		0.209 [0.011, 0.439]

Note. Estimates for Model 1 are based on 25,000 Markov Chain Monte Carlo (MCMC) iterations, while estimates for Model 2 are based on 75,000 MCMC iterations; all posterior distributions for the model estimates are inverse gamma distributions given that they are variance parameters. 95% HDI = 95% highest density interval.

The observed SD for the time frame main effect is a result of notable outliers produced over the course of MCMC sampling. After removing outliers (i.e., values greater than 50), the $σ_{C R I T E R I O N}^{2}$ SD = 2.48 and is based on 74,809 iterations.

The unstandardized results for the basic model show an estimated ICC of .249 (95% HDI⁷ [0.105, 0.448]) when only considering two potential sources of variability in discrimination values: criteria and study-specific variance. The results for the expanded model show even lower reliability (ICC = .105, 95% HDI [0.010, 0.294]). The results for the expanded model show that the primary source of systematic unreliability in discrimination estimates was the criterion × instrument interaction (σ² = .073, 95% HDI [0.034, 0.124]).

Further exploration of the criterion × instrument parameter showed that the types of measures that contributed significant variability to discrimination estimates (for certain criteria) were the AUDADIS, SSAGA, PRISM, SCID, CIDI, and SAMSHA, with the latter three measures showing significant effects for multiple criteria. Compared with the average discrimination estimate across all studies, studies that used the AUDADIS produced a higher discrimination value for social problems (b = .27, HDI [0.01, 0.55])⁸ while the SSAGA produced a higher discrimination value for role interference (b = .31, HDI [0.04, 0.59]). The PRISM produced a lower discrimination for the hazardous use criterion (b = −.61, HDI [−1.09, −0.21]). The SCID produced higher discrimination values for two criteria: hazardous use (b = .37, HDI [0.05, 0.73]) and larger/longer (b = .40, HDI [0.07, 0.76]). The CIDI produced higher discrimination estimates than the average study for both the give up activities (b = .31, HDI [0.04, 0.59]) and time spent (b = .36, HDI [0.09, 0.63]) criteria, but a lower discrimination estimate for the role interference criterion (b = −.30, HDI [−0.56, −0.04]). The SAMSHA produced higher discrimination estimates for both the role interference (b = .28, HDI [0.004, 0.56]) and social problems (b = .32, HDI [0.04, 0.60]) criteria, but a lower estimate for time spent (b = −.34, HDI [−0.62, −0.07]). Table 3 provides the posterior estimates for all criterion × instrument interactions (also see online supplementary Figure S1).

Table 3.

Posterior Median Discrimination Values for Instrument × Criterion Interactions.

	CRAV	CUTDOWN	GIVEUP	HAZUSE	HPROB	LEGAL	LRG/LONG	ROLINT	SPROB	TIME	TOL	WITD
AUDADIS	.02	−.09	.10	−.03	.11	−.12	−.03	.11	.27	−.12	−.16	−.16
CIDI	−.05	−.02	.31	−.05	.05	.01	.12	−.30	−.20	.36	−.11	−.05
SSAGA	−.12	−.09	−.04	−.17	−.06	.02	−.19	.38	.06	−.03	−.14	.28
SAMHSA	.01	−.18	.00	.18	.19	.02	−.21	.28	.32	−.34	−.15	−.15
PRISM	.00	.13	.13	−.61	.13	.00	.23	.11	−.14	.06	−.01	.17
SCID	.00	.02	−.07	.37	−.26	−.26	.40	−.27	−.25	.40	.15	−.08
OTHER	−.01	.07	−.03	−.19	−.03	.17	−.16	−.29	.11	.09	.05	.04

Note. Bolded effects indicate that the 95% credible interval for the point estimate does not include zero; AUDADIS-IV = Alcohol Use Disorder and Associated Disabilities Interview Schedule–IV; CIDI = Composite International Diagnostic Interview; SSAGA = Semistructured Assessment for the Genetics of Alcoholism; SAMHSA = Substance Abuse and Mental Health Services Administration; PRISM = Psychiatric Research Interview for Substance and Mental Disorders; SCID = Structured Clinical Interview for DSM-IV.

Despite expanding the number of variance parameters, there remained a main effect due to study-specific effects (σ² = .217, 95% HDI [0.073, 0.393]), indicating that there was notable within-study variability in discrimination estimates that was not attributable to the criteria or other sources of variability. It should be noted, however, that for some model parameters there was little variability, which limited the opportunity to examine meaningful differences. For example, many of the included samples were made up of equal parts men and women (n = 26; 50%). With regard to age, only six samples (11.54%) were composed of adolescents and only one sample was made up of elderly adults (1.92%; see online supplementary Figure S1).

Influence of σ²_CRITERION Prior Distributions

Figure 3⁹ shows the prior and posterior distributions for the basic and expanded models that made use of moderately informed priors on the σ2criterion parameter. In both models, the estimate of .60 for the σ2criterion parameter is less likely after the data are observed. In the basic model, the ratio of densities suggests strong evidence that a value of .60 is unlikely given the data (BF₁₀ ≈ .09). Specifically, a value of .60 is about 11 times less likely after having seen the data (i.e., 1/.09 = 11.11). Similar results were observed for the expanded model. In the expanded model, the ratio of densities also suggests strong evidence that a value of .60 is unlikely given the data (BF₁₀ ≈ .10). Unsurprisingly, the basic and expanded models with moderately informed priors produced higher estimates for both the σ2criterion and ICC median estimates due to the influence of the priors on the σ2criterion parameter (Unstandardized Basic Model: σ2criterion = .15; ICC = .43; Unstandardized Expanded Model: σ2criterion = .14; ICC = .34).

Figure 3.

Effects of $σ_{C R I T E R I O N}^{2}$ priors on $σ_{C R I T E R I O N}^{2}$ posterior distributions for basic and expanded model.

The results for the strongly informed priors for the basic and expanded models yielded much different results. The BFs for the basic and expanded models were BF₁₀ ≈ .97 and BF₁₀ ≈ .90, respectively. Both BFs which indicate that a hypothesis of σ2criterion = .60 is just as likely after having seen the data as it was before having seen the data.¹⁰ Thus, the posterior distributions for both models were effectively uninfluenced by any observed data and instead determined almost entirely by the strong priors on σ2criterion.

Discussion

The present results show that for any randomly selected study, the reliability of AUD criteria discrimination estimates is low, though not trivial, ranging from .11 to .43 depending on the model and scaling. The results also show that diagnostic instrument was the largest source of systematic unreliability in the average discrimination estimates. Other, more commonly investigated factors, such as gender, age, diagnostic status, and diagnostic timeframe were estimated to contribute to some degree to DIF across criteria discriminations, but these effects were less than 10% to 25% the magnitude of the instrument effect.

The lack of consistency in discrimination estimates is not inherently problematic in cases where all criteria discriminations are high, which was the case in general (see Figure 2a). Yet, there remained significant variability in discrimination estimates across criteria that was attributable to study-specific effects in the expanded model, indicating that criteria, and consequently diagnostic, precision can still be improved. If such factors are acknowledged in advance then they could reasonably be argued as true signal to be counted toward reliability (Lane et al., 2016); in which the reported estimates could be altered for the expanded models and the corresponding range indicating fair consistency (ICC range = .578-.603). In addition, if we consider the entirety of the AUD IRT literature as a whole, averaging across them (k = 52) as in the classic ICC(3, k) case (Shrout & Fleiss, 1979), we estimate a range of .865 to .941, indicating substantial reliability. These are hypothetical estimates that indicate stability in the AUD criteria on average; but caution should be taken before interpreting them too optimistically, as most researchers can only conduct a single study at a time and do not have the luxury of pooling results across 52 samples. Indeed, the BFs derived from comparing the likelihood of specific point estimates of σ2criterion showed that the present data are inconsistent with a belief that the AUD criteria would be a reliable source of discrimination variability for any given study. When contrasting the results of the moderately informed priors with the strongly informed priors, one can see just how strongly one must believe in AUD criteria reliability to maintain that belief. Specifically, the belief must be strong enough so as to ignore all the present data. Though the results from the basic and expanded models with strong priors are not incorrect per se, the models necessitate a position that appears untenable—one must ignore the data made available in the present study.

Practically speaking, the present findings for discrimination estimates suggest that which AUD criteria are better or worse at differentiating between individuals given their standing on the latent AUD spectrum are unlikely to be consistent from study to study. When considering the generally low reliability of discrimination estimates alongside the similarly low reliability observed for threshold estimates (Lane et al., 2016), the current results underscore the fact that discrimination/threshold estimates based on AUD criteria from a given study are unlikely to generalize to a separate IRT study of AUD criteria. These results have important implications for conceptualizations of AUD as a clinical construct. They also likely extend to other latent variable models of psychopathology to the extent to which consistency of factor structure is overlooked and otherwise judged by factor numeracy and/or item/factor specificity (Watts et al., 2020).

Implications for Clinical Assessment of AUD

One of the more important implications of the present study is that the choice of AUD assessment instrument may lead to substantively different conclusions regarding AUD diagnoses. For example, the SAMSHA and SSAGA instruments produced higher discrimination estimates for the role interference criterion compared with other measures. In turn, these measures assign greater weight to role interference as an indicator of the underlying AUD spectrum than do other measures. On the other hand, the CIDI produced a lower discrimination estimate for the role interference criterion. Ultimately, the results suggest that what may seem like an inconsequential decision (i.e., which assessment to choose for AUD) can result in important consequences for any given study’s understanding of the underlying disorder. The results displayed in Table 3 also offer information to clinicians and researchers about the strengths and/or limitations of particular AUD instruments in relation to their abilities to discriminate between individuals at various levels of the AUD latent spectrum. The data do not indicate that one particular measure is particularly better or worse than other measures. Instead, the results show that depending on what researchers or clinicians may be most interested in, some measures may be better suited for their interests compared with others.

For example, if a researcher is interested in ensuring that the criteria of tolerance and withdrawal reliably discriminate between individuals at differing levels of the AUD spectrum, the present results show that there are no meaningful differences across measures in assessing the discriminatory ability of tolerance and withdrawal criteria. Thus, the researcher can be relatively confident that the assessment instrument they choose will not result in systemic bias with regard to those particular criteria. However, in cases where a researcher is interested in reliably assessing role interference as a means to discriminate between individuals, the data have clear suggestions. The SSAGA and SAMSHA both provide relatively higher discrimination estimates for the role interference criterion, while the CIDI provides relatively lower estimates.

Last, the present results provide important information that can inform both past and future IRT studies on AUD, as well as other disorders. As previously noted, there have been inconsistent findings regarding which AUD criteria have displayed DIF when examining discrimination estimates (e.g., Saha et al., 2006; Harford et al., 2009). Our results suggest that systematic sources of variability (e.g., measurement) may help explain why discrimination values from a given study do not generalize to another study. This point is particularly important given the influence IRT studies have had on refining the conceptualization and diagnosis of AUD. For example, IRT studies provided strong evidence that the AUD criteria assessed a single latent continuum, leading to the distinction between abuse and dependence criteria being eliminated in the DSM-5 (Hasin et al., 2013). Furthermore, IRT studies are frequently cited as reasons for eliminating the “legal trouble” criterion from the AUD criteria set, after such studies showed that the “legal trouble” criterion showed to be uninformative regarding an individual’s standing on the latent AUD spectrum relative to the other AUD criteria (Saha et al., 2006). Thus, though the utility of IRT analyses is clear, our results highlight that more work can be done to more broadly examine the generalizability of IRT findings from a given study. One important step to take would be to directly model the factors that may influence estimation variability to potentially increase knowledge about the generalizability of the findings.

Implications for Clinical Assessment More Broadly

Though we have focused on AUD criteria in the present article, we believe that the analytical framework we utilized can also be extended to other clinical constructs to examine important clinical assessment questions. For example, researchers have argued that emotion dysregulation ought to be considered the core feature of borderline personality disorder (BPD; Glenn & Klonsky, 2009). As such, clinicians and researchers are likely to want to ensure that BPD assessment instruments contain emotion dysregulation items/criteria that reliably distinguish between individuals at varying levels of the BPD spectrum. Leveraging the current methods to explore these empirical questions may provide important information for clinicians and researchers seeking to use instruments that reliably assess core features of various disorders. Furthermore, the current methods may be useful to future IRT work on diagnostic criteria given that interest in replicability issues within clinical psychology research continues to grow (Tackett et al., 2017; Tackett et al., 2019).

It is also worthwhile to contrast the present results with previous analyses examining AUD criteria threshold estimates (Lane et al., 2016). Though both the discrimination and threshold analyses found 12 and 13 random effects, respectively, (out of 84; 12 criteria × 7 instruments) that were noteworthy when examining the criteria × instrument interaction, there was only one overlapping effect. For example, in the original meta-analysis of threshold estimates, 5 of the 13 significant effects were found for the AUDADIS while only one was found for the AUDADIS with regard to discrimination estimates, and for a separate criterion. The only overlap observed was for the SAMHSA and the time spent criterion—both threshold and discrimination estimates were lower than in the average study. Taken together, the combined meta-analytic results suggest that both threshold and discrimination estimates are in part a function of the AUD diagnostic instrument used, and the effects of diagnostic instruments span a diverse range of AUD criteria.

Limitations

There are some important limitations in the present study. First, there are other potential sources of variability in discrimination estimates that were not explored. For example, ethnicity has been examined as a source of DIF, with previous work finding evidence of DIF across different ethnic groups (e.g., Saha et al., 2006). However, there was little variability in ethnic makeup in the included samples and other options (e.g., a dichotomous comparison of White vs. non-White samples) would likely mask important differences across ethnic groups (Sue, 2001). An additional source of variability not explicitly accounted for is study quality. Our analyses included individual study sample as a random factor, which spans large-scale epidemiological studies through more idiosyncratic investigations of smaller clinical groups. The analyses also included research group as a random factor, which would capture whether some groups may produce higher quality, or “more significant,” results than others, or in higher impact journals. We did not code factors of study quality, and the variables we coded were not meant to be indicators as such. The present results should be considered with these caveats in mind.

Second, of the parameters included in the expanded models, some showed little variability (e.g., gender and age). Thus, despite their inclusion, the estimates for these parameters should be interpreted with caution. Though our results are generally consistent with DIF across gender and age as reported in previous investigations (e.g., larger/longer, withdrawal; see online supplementary Figure S2), as more data become available, it may be possible to more fully and precisely explore the influence of such factors on discrimination estimates.

Conclusion

There is generally low, but nontrivial consistency in AUD criteria discriminations within individual studies. This may underlie why DIF for various factors across individual criteria has been inconsistent. Nevertheless, averaging across all available data, there appears to be consistent structure in the discriminations of the set of AUD criteria. In other words, when one averages across studies, the AUD criteria do produce a consistent structure with respect to the overall variability observed in discrimination estimates, suggesting consistent rank-order stability in discriminations among the criteria. Thus, at the average level, the criteria are consistent in their abilities to differentiate between persons at varying levels along the AUD spectrum. This represents a silver lining in that an identifiable core structure is recoverable; but our primary analyses illustrate that such results cannot be expected in individual studies. It also ignores the considerable heterogeneity that exists in the latent AUD construct (Lane et al., 2016). We still found evidence of DIF as a function of established factors (e.g., age, gender), providing meta-analytic support on average, but the magnitudes were small in comparison with the effect of instrument. Thus, consideration of the assessment instrument, in addition to more common factors, remains a critical open issue regarding the inferred comparability of AUD diagnosis and severity across individuals and contexts.

Supplemental Material

sj-pdf-1-asm-10.1177_1073191120986613 – Supplemental material for Reliability of Differential Item Functioning in Alcohol Use Disorder: Bayesian Meta-Analysis of Criteria Discrimination Estimates

Supplemental material, sj-pdf-1-asm-10.1177_1073191120986613 for Reliability of Differential Item Functioning in Alcohol Use Disorder: Bayesian Meta-Analysis of Criteria Discrimination Estimates by Colin E. Vize and Sean P. Lane in Assessment

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received the following financial support for the research, authorship, and/or publication of this article: Sean Lane was supported through funding provided by the National Institues of Health (Grant R01AA027264).

ORCID iD

Colin E. Vize

Supplemental Material

Supplemental material for this article is available online.

Notes

References

American Psychiatric Association. (1942). Diagnostic and statistical manual for mental disorders (1st ed.). Author.

American Psychiatric Association. (1952). Diagnostic and statistical manual for mental disorders (2nd ed.). Author.

American Psychiatric Association. (1994). Diagnostic and statistical manual for mental disorders (4th ed.). Author.

American Psychiatric Association. (2013). Diagnostic and statistical manual for mental disorders (5th ed.). Author. https://doi.org/10.1176/appi.books.9780890425596

Balsis

Gleason

M. E.

Woods

C. M.

Oltmanns

T. F.

(2007). An item response theory analysis of DSM-IV personality disorder criteria across younger and older age groups. Psychology and Aging, 22(1), 171-185. https://doi.org/10.1037/0882-7974.22.1.171

*Beseler

C. L.

Taylor

L. A.

Leeman

R. F.

(2010). An item-response theory analysis of DSM-IV alcohol-use disorder criteria and “binge” drinking in undergraduates. Journal of Studies on Alcohol and Drugs, 71(3), 418-423.

*Bond

Cherpitel

C. J.

Borges

Cremonte

Moskalewicz

Swiatkiewicz

(2012). Scaling properties of the combined ICD-10 dependence and harms criteria and comparisons with DSM-5 alcohol use disorder criteria among patients in the emergency department. Journal of Studies on Alcohol and Drugs, 73(2), 328-336.

*Borges

Cherpitel

C. J.

Bond

Cremonte

Moskalewicz

Swiatkiewicz

(2011). Threshold and optimal cut-points for alcohol use disorders among patients in the emergency department. Alcoholism: Clinical & Experimental Research, 35(7), 1270-1276.

Borges

Bond

Cherpitel

C. J.

Cremonte

Moskalewicz

G. S.

Maritza

(2010). The dimensionality of alcohol use disorders and alcohol consumption in a cross-national perspective. Addiction, 105(2), 240-254. https://doi.org/10.1111/j.1360-0443.2009.02778.x

10.

Brennan

R. L.

(1992). Generalizability theory. Educational Measurement: Issues and Practice, 11(4), 27-34.

11.

*Caetano

Vaeth

P. A.

Mills

Canino

(2016). Employment status, depression, drinking, and alcohol use disorder in Puerto Rico. Alcoholism: Clinical and Experimental Research, 40(4), 806-815.

12.

*Casey

Adamson

Shevlin

McKinney

(2012). The role of craving in AUDs: Dimensionality and differential functioning in the DSM-5. Drug and Alcohol Dependence, 125(1-2), 75-80.

13.

*Castaldelli-Maia

J. M.

Wang

Y. P.

Borges

Silveira

C. M.

Siu

E. R.

Viana

M. C.

Andrade

A. G.

Martins

S. S.

Andrade

L. H.

(2015). Investigating dimensionality and measurement bias of DSM-5 alcohol use disorder in a representative sample of the largest metropolitan area in South America. Drug and alcohol dependence, 152, 123-130.

14.

*Cherpitel

C. J.

Borges

Bond

Cremonte

Moskalewicz

Swiatkiewicz

(2010). Performance of a craving criterion in DSM alcohol use disorders. Journal of Studies on Alcohol and Drugs, 71(5), 674-684.

15.

Chung

Martin

C. S.

(2002). Concurrent and discriminant validity of DSM-IV symptoms of impaired control over alcohol consumption in adolescents. Alcoholism: Clinical & Experimental Research, 26(4), 485-492. https://doi.org/10.1111/j.1530-0277.2002.tb02565.x

16.

Cooper

L. D.

Balsis

(2009). When less is more: How fewer diagnostic criteria can indicate greater severity. Psychological Assessment, 21(3), 285–293.

17.

*Dawson

D. A.

Saha

T. D.

Grant

B. F.

(2010). A multidimensional assessment of the validity and utility of alcohol use disorder severity as determined by item response theory models. Drug and Alcohol Dependence, 107(1), 31-38.

18.

*Derringer

Krueger

R. F.

Dick

D. M.

Agrawal

Bucholz

K. K.

Foroud

Grucza

R. A.

Hesselbrock

M. N.

Hesselbrock

Kramer

Nurnberger

J. I.

Jr. Schuckit

Bierut

L. J.

Iacono

W. G.

McGue

(2013). Measurement invariance of DSM-IV alcohol, marijuana and cocaine dependence between community-sampled and clinically over-selected studies. Addiction, 108(10), 1767-1776.

19.

*Duncan

A. E.

Agrawal

Bucholz

K. K.

Sartor

C. E.

Madden

P. A.

Heath

A. C.

(2011). Deconstructing the architecture of alcohol abuse and dependence symptoms in a community sample of late adolescent and emerging adult women: An item response approach. Drug and Alcohol Dependence, 116(1-3), 222-227.

20.

DuMouchel

(1990). Bayesian meta-analysis. Statistical Methodology in the Pharmaceutical Sciences, 10(4), 509-529. https://doi.org/10.1177/096228020101000404

21.

*Edwards

A. C.

Gillespie

N. A.

Aggen

S. H.

Kendler

K. S.

(2013). Assessment of a modified DSM-5 diagnosis of alcohol use disorder in a genetically informative population. Alcoholism: Clinical and Experimental Research, 37(3), 443-451.

22.

*Ehlke

S. J.

Hagman

B. T.

Cohn

A. M.

(2012). Modeling the dimensionality of DSM-IV alcohol use disorder criteria in a nationally representative sample of college students. Substance Use & Misuse, 47(10), 1073-1085.

23.

Fernádez-i-Marín

(2016). ggmcmc: Analysis of MCMC samples and Bayesian inference. Journal of Statistical Software, 70(9), 1-20. https://doi.org/10.18637/jss.v070.i09

24.

*Gelhorn

Hartman

Sakai

Stallings

Young

Rhee

Corley

Hewitt

Hopfer

Crowley

(2008). Toward DSM-V: An item response theory analysis of the diagnostic process for DSM-IV alcohol abuse and dependence in adolescents. Journal of the American Academy of Child & Adolescent Psychiatry, 47(11), 1329-1339.

25.

*Gilder

D. A.

Gizer

I. R.

Ehlers

C. L.

(2011). Item response theory analysis of binge drinking and its relationship to lifetime alcohol use disorder symptom severity in an American Indian community sample. Alcoholism: Clinical & Experimental Research, 35(5), 984-995.

26.

Glenn

C. R.

Klonsky

E. D.

(2009). Emotion dysregulation as a core feature of borderline personality disorder. Journal of Personality Disorders, 23(1), 20-28.

27.

*Hagman

B. T.

(2017). Development and psychometric analysis of the Brief DSM–5 Alcohol Use Disorder Diagnostic Assessment: Towards effective diagnosis in college students. Psychology of Addictive Behaviors, 31(7), 797–806. https://doi.org/10.1037/adb0000320

28.

*Hagman

B.T.

Cohn

A.M.

(2011). Toward DSM-V: Mapping the alcohol use disorder continuum in college students. Drug and Alcohol Dependence, 118(2-3), 202–208.

29.

*Hagman

B. T.

Cohn

A. M.

(2013). Using latent variable techniques to understand DSM-IV alcohol use disorder criteria functioning. American Journal of Health Behavior, 37(4), 565–574.

30.

*Harford

T. C.

Hsiao-ye

Faden

V. B.

Chen

C. M.

(2009). The dimensionality of DSM-IV alcohol use disorders among adolescents and adult drinkers and symptom patterns by age, gender, and race/ethnicity. Alcoholism: Clinical & Experimental Research, 33(5), 868-878. https://doi.org/10.1111/j.1530-0277.2009.00910.x

31.

*Hasin

D. S.

Fenton

M. C.

Beseler

Park

J. Y.

Wall

M. M.

(2012). Analyses related to the development of DSM-5 criteria for substance use related disorders: 2. Proposed DSM-5 criteria for alcohol, cannabis, cocaine and heroin disorders in 663 substance abuse patients. Drug and Alcohol Dependence, 122(1-2), 28-37.

32.

Hasin

D. S.

O’Brien

C. P.

Auriacombe

Borges

Bucholz

Budney

Compton

W. M.

Crowley

Ling

Petry

N. M.

Schuckit

Grant

B. F.

(2013). DSM-5 criteria for substance use disorders: Recommendations and rationale. American Journal of Psychiatry, 170(8), 834-851. https://doi.org/10.1176/appi.ajp.2013.12060782

33.

*Keyes

K. M.

Krueger

R. F.

Grant

B. F.

Hasin

D. S.

(2011). Alcohol craving and the dimensionality of alcohol disorders. Psychological Medicine, 41(3), 629-640.

34.

Kruschke

J. K.

Liddell

T. M.

(2018). The Bayesian new statistics: Hypothesis testing, estimation, meta-analysis, and planning from a Bayesian perspective. Psychonomic Bulletin & Review, 25(1), 178-206. https://doi.org/10.3758/s13423-016-1221-4

35.

*Kuerbis

A.N.

Hagman

B.T.

Morgenstern

(2013a). Alcohol use disorders among substance dependent women on temporary assistance with needy families: More information for diagnostic modifications for DSM-5. The American Journal on Addictions, 22(4), 402-410.

36.

*Kuerbis

A. N.

Hagman

B. T.

Sacco

(2013b). Functioning of alcohol use disorders criteria among middle-aged and older adults: Implications for DSM-5. Substance Use & Misuse, 48(4), 309-322.

37.

Lane

S. P.

Steinley

Sher

K. J.

(2016). Meta-analysis of DSM alcohol use disorder criteria severities: Structural consistency is only “skin deep.” Psychological Medicine, 46(8), 1769-1784. https://doi.org/10.1017/S0033291716000404

38.

*Lane

S. P.

Sher

K. J.

(2015). Limits of current approaches to diagnosis severity based on criterion counts: An example with DSM-5 alcohol use disorder. Clinical Psychological Science, 3(6), 819-835.

39.

*Langenbucher

J. W.

Labouvie

Martin

C. S.

Sanjuan

P. M.

Bavly

Kirisci

Chung

(2004). An application of item response theory analysis to alcohol, cannabis, and cocaine criteria in DSM-IV. Journal of Abnormal Psychology, 113(1), 72-80.

40.

*Martin

C. S.

Chung

Kirisci

Langenbucher

J. W.

(2006). Item response theory analysis of diagnostic criteria for alcohol and cannabis use disorders in adolescents: Implications for DSM-V. Journal of Abnormal Psychology, 115(4), 807-814. https://doi.org/10.1037/0021-843X.115.4.807

41.

Martin

C. S.

Langenbucher

J. W.

Chung

Kenneth

Louis

(2014). Truth or consequences in the diagnosis of substance use disorders. Addiction, 109(11), 1773-1778. https://doi.org/10.1111/add.12615

42.

Martin

C. S.

Sher

K. J.

Chung

(2011). Hazardous use should not be a diagnostic criterion for substance use disorders in DSM-5. Journal of Studies on Alcohol and Drugs, 72(4), 685-686. https://doi.org/10.15288/jsad.2011.72.685

43.

*McCutcheon

V. V.

Agrawal

Heath

A. C.

Edenberg

H. J.

Hesselbrock

V. M.

Schuckit

M. A.

Kramer

J. R.

Bucholz

K. K.

(2011). Functioning of alcohol use disorder criteria among men and women with arrests for driving under the influence of alcohol. Alcoholism: Clinical & Experimental Research, 35(11), 1985-1993.

44.

*Mewton

Slade

McBride

Grove

Teesson

(2011a). An evaluation of the proposed SM-5 alcohol use disorder criteria using Australian national data. Addiction, 106(5), 941-950.

45.

*Mewton

Teesson

Slade

Cottler

(2011b). Psychometric performance of DSM-IV alcohol use disorders in young adulthood: Evidence from an Australian general population sample. Journal of Studies on Alcohol and Drugs, 72(5), 811-822.

46.

Moher

Liberati

Tetzlaff

Altman

D. G.

Group

T. P.

(2009). Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement. PLOS MEDICINE, 6(7), e1000097. https://doi.org/10.1371/journal.pmed.1000097

47.

O’Brien

(2011). Addiction and dependence in DSM-V. Addiction, 106(5), 866-867. https://doi.org/10.1111/j.1360-0443.2010.03144.x

48.

*Preuss

U. W.

Watzke

Wurst

F. M.

(2014). Dimensionality and stages of severity of DSM-5 criteria in an international sample of alcohol-consuming individuals. Psychological Medicine, 44(15), 3303-3314.

49.

*Proudfoot

Baillie

A. J.

Teeson

(2006). The structure of alcohol dependence in the community. Drug and Alcohol Dependence, 81(1), 21-26.

50.

R Core Team. (2017). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org/

51.

Reise

S. P.

Waller

N. G.

(2009). Item response theory and clinical measurement. Annual Review of Clinical Psychology, 5, 27-48.

52.

*Rose

J. S.

Lee

C. T.

Selya

A. S.

Dierker

L. C.

(2012). DSM-IV alcohol abuse and dependence criteria characteristics for recent onset adolescent drinkers. Drug and Alcohol Dependence, 124(1-2), 88-94.

53.

*Saha

T. D.

Chou

S. P.

Grant

B. F.

(2006). Toward an alcohol use disorder continuum using item response theory: Results from the National Epidemiologic Survey on alcohol and related conditions. Psychological Medicine, 36(7), 931-941. https://doi.org/10.1017/S003329170600746X

54.

*Saha

T. D.

Stinson

F. S.

Grant

B. F.

(2007). The role of alcohol consumption in future classifications of alcohol use disorders. Drug and Alcohol Dependence, 89(1), 82-92.

55.

SAS Institute Inc. (2013). SAS/STAT User’s Guide, Version 9.4. Cary, NC: SAS Institute Inc.

56.

Sayette

M. A.

(2016). The role of craving in substance use disorders: Theoretical and methodological issues. Annual Review of Clinical Psychology, 12, 407-433. https://doi.org/10.1146/annurev-clinpsy-021815-093351

57.

*Shmulewitz

Keyes

K.M.

Beseler

Aharonovich

Aivadyan

Spivak

Hasin

(2010). The dimensionality of alcohol use disorders: Results from Israel. Drug and Alcohol Dependence, 111(1-2), 146-154.

58.

Shrout

P. E.

Fleiss

J. L.

(1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420-428. https://doi.org/10.1037/0033-2909.86.2.420

59.

*Srisurapanont

Kittiratanapaiboon

Likhitsathian

Kongsuk

Suttajit

Junsirimongkol

(2012). Patterns of alcohol dependence in Thai drinkers: A differential item functioning analysis of gender and age bias. Addictive Behaviors, 37(12), 173-178.

60.

Sue

D. W.

(2001). Multidimensional facets of cultural competence. The Counseling Psychologist, 29(6), 790-821. https://doi.org/10.1177/0011000001296002

61.

Tackett

J. L.

Brandes

C. M.

King

K. M.

Markon

K. E.

(2019). Psychology’s replication crisis and clinical psychological science. Annual Review of Clinical Psychology, 15, 579-604. https://doi.org/10.1146/annurev-clinpsy-050718-095710

62.

Tackett

J. L.

Lilienfeld

S. O.

Patrick

C. J.

Johnson

S. L.

Krueger

R. F.

Miller

J. D.

Oltmanns

T. F.

Shrout

P. E.

(2017). It’s time to broaden the replicability conversation: Thoughts for and from clinical psychological science. Perspectives on Psychological Science, 12(5), 742-756. https://doi.org/10.1177/1745691617690042

63.

Uebelacker

L. A.

Strong

Weinstock

L. M.

Miller

I. W.

(2009). Use of item response theory to understand differential functioning of DSM-IV major depression symptoms by race, ethnicity and gender. Psychological medicine, 39(4), 591-601.

64.

Wagenmakers

E. J.

Lodewyckx

Kuriyal

Grasman

(2010). Bayesian hypothesis testing for psychologists: A tutorial on the Savage-Dickey method. Cognitive Psychology, 60(3), 158-189. https://doi.org/10.1016/j.cogpsych.2009.12.001

65.

Wagenmakers

E. J.

Morey

R. D.

Lee

M. D.

(2016). Bayesian benefits for the pragmatic researcher. Current Directions in Psychological Science, 25(3), 169-176. https://doi.org/10.1177/0963721416643289

66.

Watts

A. L.

Lane

S. P.

Bonifay

Steinley

Meyer

F. A. C.

(2020). Building theories on top of, and not independent of, statistical models: The case of the p-factor. Psychological Inquiry. https://doi.org/10.31234/osf.io/3vsey

67.

Wickham

(2009). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-0-387-98141-3

68.

Witkiewitz

Hallgren

K. A.

O’Sickey

A. J.

Roos

C. R.

Maisto

S. A.

(2016). Reproducibility and differential item functioning of the alcohol dependence syndrome construct across four alcohol treatment studies: An integrative data analysis. Drug and Alcohol Dependence, 158, 86-9.

69.

*Wu

L. T.

Blazer

D. G.

Woody

G. E.

Burchett

Yang

Pan

J. J.

Ling

(2012). Alcohol and drug dependence symptom items as brief screeners for substance use disorders: Results from the Clinical Trials Network. Journal of Psychiatric Research, 46(3), 360-369.

70.

*Wu

L. T.

Pan

J. J.

Blazer

D. G.

Tau

Stitzer

M. L.

Brooner

R. K.

Woody

G. E.

Patkar

A. A.

Blaine

J. D.

(2009). An item response theory modeling of alcohol and marijuana dependences: A National Drug Abuse Treatment Clinical Trials Network study. Journal of Studies on Alcohol and Drugs, 70, 414–425.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.58 MB

Reliability of Differential Item Functioning in Alcohol Use Disorder: Bayesian Meta-Analysis of Criteria Discrimination Estimates

Abstract

Keywords

IRT Analyses of AUD Criteria

The Current Study

Method 2

Literature Search

Coded Information

Analyses

Results

Influence of σ2CRITERION Prior Distributions

Discussion

Implications for Clinical Assessment of AUD

Implications for Clinical Assessment More Broadly

Limitations

Conclusion

Supplemental Material

sj-pdf-1-asm-10.1177_1073191120986613 – Supplemental material for Reliability of Differential Item Functioning in Alcohol Use Disorder: Bayesian Meta-Analysis of Criteria Discrimination Estimates

Footnotes

Declaration of Conflicting Interests

Funding

ORCID iD

Supplemental Material

Notes

References

Supplementary Material

Method²

Influence of σ²_CRITERION Prior Distributions