Bayesian Approaches to Assessing the Parallel Lines Assumption in Cumulative Ordered Logit Models

Abstract

We first review existing literature on cumulative logit models along with various ways to test the parallel lines assumption. Building on the traditional frequentist framework, we introduce a method of Bayesian assessment of null values to provide an alternative way to examine the parallel lines assumption using highest density intervals and regions of practical equivalence. Second, we propose a new hyperparameter cumulative logit model that can improve upon existing ones in addressing several challenges where traditional modeling techniques fail. We use two empirical examples from health research to showcase the Bayesian approaches.

Keywords

Bayesian parallel lines assumption ordered regression models

Social science and health researchers often work with ordinal measures as response variables and thus routinely rely on ordered regression models for analyses (Agresti 2010; Cox 1995; Fullerton 2009; Fullerton and Xu 2016; Greene and Hensher 2012; Imai, King, and Lau 2008; Long 1997; Powers and Xie 2009; Williams 2006; Winship and Mare 1984). Among the many forms of ordered regression models, the parallel cumulative logit model is perhaps the most commonly used—it is the default model for ordered responses in most statistical software and receives extensive treatment in statistics textbooks (Agresti 2010; Greene and Hensher 2012; McCullagh and Nelder 1989). A key assumption of the parallel cumulative logit model is the parallel lines (PL; or proportional odds) assumption in which coefficients are constrained to be equal across cut-point equations.

Statistical texts recommend testing the PL assumption and, in the event of a significant test, relaxing the PL assumption for a subset or all of the variables in a model by allowing their coefficients to vary (either freely or by a proportionality constant) across cut-point equations (Agresti 2010; Fullerton and Xu 2016; Greene and Hensher 2012; Long 1997; Williams 2006, 2016). In practice, statistical tests, particularly omnibus tests for all of the covariates, often indicate that the PL assumption is violated. Allowing coefficients to vary, however, comes at a substantial cost to parsimony. This cost in parsimony may be necessary when coefficients meaningfully vary across cut-point equations, but the standard tests for the PL assumption may reject the assumption in cases where the variation in coefficients is of little practical consequence.

Bayesian approaches to fitting ordered logit models and evaluating the PL assumption can help researchers better evaluate the trade-offs between relaxing the PL assumption and maintaining model parsimony. This article outlines two Bayesian approaches to tackling the PL assumption. The first Bayesian approach draws on advantages of using highest density intervals (HDIs) and regions of practical equivalence (ROPEs; Kruschke 2015; Kruschke and Liddell 2017) to examine whether differences in coefficients across cut-point equations fall within acceptable intervals as opposed to on a point, such as zero. The second Bayesian approach simply views the PL assumption as unrealistic and specifies a Bayesian hyperparameter cumulative logit model that allows for variation in coefficients across cut-point equations but maintains a higher order degree of homogeneity in that variation. Taken together, these two Bayesian approaches provide a flexible alternative to deal with the PL assumption in ordered logit models.

The Classical Parallel Cumulative Logit Model

The classical parallel cumulative logit model, also known as the proportional odds model, is built upon binary regression models that first emerged in the early 1930s and 1940s (Bliss 1934a, 1934b; Finney 1947). A series of papers from the 1940s to 1970s established important theoretical properties and the empirical foundations for applying the parallel cumulative logit model (Finney 1947; McKelvey and Zavoina 1975). It was not until the 1980s and the work of McCullagh (1980) and Winship and Mare (1984), however, that this model became well-established in the toolbox of social science practitioners.

The parallel cumulative logit model can be derived using a latent variable approach (Greene and Hensher 2012; Long 1997; McCullagh and Nelder 1989; Powers and Xie 2009). Suppose we have a latent continuous variable, $y^{*}$ , with a structural model, $y^{*} = x β + ε$ , linking a matrix of observed predictors, x, and a vector of structural parameters, $β$ , to $y^{*}$ , the latent continuous variable. Instead of directly measuring the latent continuous variable, we observe an ordinal response variable with values $y = 1, 2, ..., l, ..., L$ , and it is safe to assume that $y = l \Leftrightarrow τ_{l - 1} < y^{*} < τ_{l}$ , in which $τ$ 's are unknown cut-point (auxiliary) parameters to be estimated. The probability that the observed variable y equals a specific value l conditional on x is given by:

\begin{array}{l} P (y = l | x) \\ = P (y^{*} < τ_{l}) - P (y^{*} < τ_{l - 1}) \\ = F (τ_{l} - x β) - F (τ_{l - 1} - x β), \end{array}

where $τ_{0}$ and $τ_{L}$ can be set to $- \infty$ and $+ \infty$ , respectively, such that equations hold consistently across different cut points; $F (\cdot)$ is a generic cumulative distribution function. For the parallel cumulative logit model, when the standard logistic distribution is used, we have

\begin{array}{l} F (τ_{l} - x β) = Λ (τ_{l} - x β) \\ = exp (τ_{l} - x β) / (1 + exp (τ_{l} - x β)) \\ = 1 / (1 + exp (x β - τ_{l})) . \end{array}

Other distribution functions are also possible, such as the probit (standard normal) or complementary log-log. Since the standard logistic distribution is the most widely used function for an ordered regression model, we focus on logit models in this article, but note that most of our discussions also pertain to models using alternative functions.

The parallel cumulative logit model is also known as the proportional odds model because if we have two x vectors, x₁ and x₂, then the two odds ratios, $P (y \leq l | x_{1}) / (1 - P (y \leq l | x_{1}))$ and $P (y \leq l | x_{2}) / (1 - P (y \leq l | x_{2}))$ , are proportional to one another by a factor of

\frac{P (y \leq l | x_{1}) / (1 - P (y \leq l | x_{1}))}{P (y \leq l | x_{2}) / (1 - P (y \leq l | x_{2}))} = exp ((x_{2} - x_{1}) β),

based on equations (1) and (2) (McCullagh and Nelder 1989). In equation (3), $β$ stays the same regardless of the values for l. Graphically, an imposition of the proportional odds assumption in ordered regression models forces the distribution curves (e.g., $P (y \leq l)$ ) to shift in a parallel way, hence the alternative label the “parallel lines” assumption, which is a more generic term for ordered regression models to have the same slope vector across different equations, be it cut-point, stage, or adjacent odds (Fullerton and Xu 2016) Figure 1.

Figure 1.

Ordered regression models with the parallel line assumption.

The parallel cumulative logit regression provides a parsimonious model for ordered responses. The PL assumption, however, that underlies the parallel cumulative logit model is a strong but hardly realistic assumption and thus is often violated in practice. The PL assumption is typically tested one coefficient or a subset of coefficients (e.g., a group of dummy variables for nominal-level variables) at a time, or for all coefficients simultaneously (i.e., omnibus test), and has the null hypothesis, H₀: $β_{l k} = β_{k}$ , for regressor k and cut point l. Sometimes, when just one coefficient deviates from others for a single predictor, this could be used as sufficient evidence to reject the null hypothesis.

Nonparallel and Partial Cumulative Logit Models

When the PL assumption is violated, researchers may turn to the nonparallel cumulative ordered regression model (also known as the generalized ordered logit model). Instead of having the same set of coefficients across cut-point equations, the nonparallel cumulative ordered regression model allows the coefficients to vary freely. In place of equation (1), we have

P (y = l | x) = Λ (τ_{l} - x β_{l}) - Λ (τ_{l - 1} - x β_{l - 1}),

where $β_{l}$ denotes a level-specific vector of coefficients that vary across different cut-point equations. Compared with the parallel cumulative logit model, the nonparallel cumulative logit model lies at the other end of the spectrum of model complexity. For an ordered response with L levels and K predictors, the nonparallel cumulative logit model has (L − 1) × (K + 1) coefficients to be estimated as compared with K + (L − 1) coefficients in the parallel cumulative logit model. Usually, the more parameters there are in a model the better the predictive power, but they come at the expense of parsimony and run the risk of having negative predicted probabilities (Fullerton 2009; Greene and Hensher 2012; Williams 2006, 2016).

The partial cumulative logit model compromises between the parallel and nonparallel models by setting some of the coefficients to vary across equations and others to remain constant (see Fullerton [2009] and Fullerton and Xu [2016] for additional alternative model specifications). The model is given by:

P (y = l | x) = Λ (τ_{l} - x β_{l} - z γ) - Λ (τ_{l - 1} - x β_{l - 1} - z γ),

where $β_{l}$ is a vector of coefficients that vary freely and $γ$ is a vector of coefficients held constant across cut-point equations. Researchers typically rely on post-hoc tests, such as the Brant test or stepwise Wald tests to decide on the specification of a partial cumulative logit model (Fullerton and Xu 2016; Williams 2006). Since the choice of a partial model over its alternatives is often ad hoc, and usually does not have very solid theoretical ground, this article focuses on the parallel and nonparallel models only, without loss of generality.

Testing the Parallel Lines Assumption

A number of statistical tests have been developed to test the PL assumption. The most well-known include the Brant, Wald, likelihood ratio (LR), and score (or Rao or Lagrange multiplier) tests (Brant 1990; Buse 1982; Engle 1984; Fullerton and Xu 2016; Greene and Hensher 2012; Long 1997; Powers and Xie 2009). To summarize, the Wald test compares parameter estimates from the full model with those from a restricted (null) one normalized by standard errors of the estimates from the full. For a set of constraints imposed on the parameter estimates, such as $H_{0} : T \hat{β} = r$ , the Wald test statistic is computed as follows:

W = (T {\hat{β}}_{F} - r) ′ {(T \hat{V} ({\hat{β}}_{F}) T ′)}^{- 1} (T {\hat{β}}_{F} - r) ∼ χ_{Q}^{2} .

where ${\hat{β}}_{F}$ is a vector of maximum likelihood (ML) estimates from the full model; T, a transformation matrix; r, a vector containing Q independent constraints; and $\hat{V} ({\hat{β}}_{F})$ is the estimated covariance matrix of the ML estimates, ${\hat{β}}_{F}$ (see Buse 1982; Long 1997 and Power and Xie 2009 for more accessible details). The Brant test is a simple extension and loose form of the Wald test, designed specifically to test the PL assumption. The Brant test compares coefficients from a parallel cumulative logit model with those from a set of $L - 1$ separate binary regressions, where the binary response variables are coded as 1 if $y < l$ and 0 otherwise and, then construct a $χ^{2}$ test statistic accordingly (see Brant [1990] for technical details and Long and Freese [1997] as well as Fullerton and Xu [2016] for more accessible information).

The LR test, on the other hand, compares the log-likelihoods from the full and restricted models. The test statistic has the following form

LR = 2 ({LL}_{F} - {LL}_{R}) \sim χ_{Q}^{2},

where ${LL}_{F}$ and ${LL}_{R}$ are the log-likelihoods for the full and restricted models, respectively. The test statistic has a χ² distribution with Q degrees of freedom.

The score test is based on the gradient of the log-likelihood function of the restricted model normalized by its variance, and the test statistic is constructed as:

g ({\hat{β}}_{R}) ′ I {({\hat{β}}_{R})}^{- 1} g ({\hat{β}}_{R}) \sim χ_{Q}^{2},

where $g ({\hat{β}}_{R})$ is the score function (gradient) and I $({\hat{β}}_{R})$ is the negative expectation of the Hessian matrix of the log-likelihood (or Fisher’s information matrix), all estimated from the restricted model.

In this context, the full model may correspond to the nonparallel cumulative logit model in which the PL assumption is relaxed for all predictors, and a restricted model could refer to any reduced rank of the full model, such as a partial or a parallel cumulative logit model where the PL assumption holds for some or all predictors. Although all of the three tests are thought to be asymptotically equivalent, test results can diverge and have somewhat different properties in finite samples (see Fullerton and Xu 2016:132, for additional details and a comparison of the tests).

In addition to these classical null hypothesis significance testing (NHST) procedures, it is possible to compare the fit of models that maintain, relax, or partially relax the PL assumption using an array of R ² measures (e.g., McFadden’s pseudo R ², count R ², and Lacy’s R ²) or model selection criteria (e.g., the Akaike information criterion [AIC] or the Bayesian information criterion [BIC]; Akaike 1974; Fullerton and Xu 2016; Greene and Hensher 2012; Lacy 2006; McFadden 1973; Raftery 1995; Schwarz 1978). In practice, however, given the lack of statistical distributions or agreement about which, if any, of the R ² measures is best, researchers typically rely on one of the previously mentioned statistical tests for assessing the PL assumption.

Limitations with NHST of the PL Assumption

As Meehl (1967, 1997), Edwards and Berry (2010), and Kruschke (2017) cogently argue, the likelihood of rejecting null hypotheses in social sciences is generally high by design and they are often false a priori. When it comes to the PL assumption, first and foremost, the idea (or hypothesis) that the slopes for all predictors are equal across all cut-point equations is simply not realistic. In other words, it does not take a statistical test to realize that to satisfy all the conditions simultaneously is hardly possible. Second, if the coefficients for one or multiple predictors are statistically different based on an omnibus or semi-omnibus test, and we want to detect the exact source of that difference, we may have to conduct a series of tests or some form of simultaneous hypothesis testing; the former may increase the likelihood of a false alarm/positive, and the latter needs some form of adjustment/correction to be valid, which in many cases can be overly conservative (e.g., Bonferroni correction). Third, while testing the equivalence of two slopes across equations indexed by l and m for a predictor x_k , such as $H_{0} : β_{l k} = β_{m k}$ , conceptually we do not allow the difference to be even a minor deviation from zero. But practically, statistically significant results could be empirically trivial so as not to produce substantial difference in most post-estimation analyses such as predictions of or effects on the ordinal response. Under the classical NHST framework, a few other problems include, for example, that we can only fail to reject but not to accept the null hypothesis (Gelman, Hill, and Yajima 2012; Kruschke 2011, 2013, 2015); while using NHST, what we get is the probability of observing the data given a true null, $P (D | H_{0})$ , whereas what we really want is the probability of the null given the data, $P (H_{0} | D)$ (Fullerton and Xu 2016; Kruschke and Liddell 2017).

Bayesian Approach to Cumulative Logit Models

A Bayesian approach to fitting cumulative logit models helps address some of the limitations of traditional methods to assess the PL assumption discussed previously. In this section, we specify the Bayesian parallel and nonparallel cumulative logit models, illustrate how to use both HDIs and ROPEs in evaluating the PL assumption, and propose a new Bayesian cumulative logit model with hyperparameters that provides an alternative approach to addressing variation in the coefficients across cut-point equations.

The fundamental equation within a Bayesian framework is given by:

p (θ | D) \propto L (D | θ) p (θ),

where $p (θ | D)$ is the posterior distribution of the parameter vector $θ$ , $L (D | θ)$ is the likelihood given θ, and p(θ) is the prior probability distribution of θ. Although in some cases closed form solutions (e.g., using conjugate priors) for the posterior distribution are available, typically researchers rely on various numerical simulation methods to obtain the posterior distribution (Gelman et al. 2014; Kruschke 2015). Once the posterior is obtained, one can make probabilistic statements about the parameters of interest, since the posterior captures the full distribution of the parameters (Kruschke 2015).

A Bayesian parallel cumulative logit model can be specified as follows. First the likelihood conditional on the model parameters is given by:

L (y | β, τ_{l}) = \prod_{​}^{​} (Λ (τ_{l} - x β) - Λ (τ_{l - 1} x β)),

in which $β$ is a vector of coefficients that do not vary across the L − 1 equations, $τ_{l}$ is the cut-point parameter for equation l, and $Λ$ corresponds to the standard logistic distribution, as they are under the typical frequentist framework. Many options are available, but the most popular and well-accepted strategy with Bayesian ordered logit as well as most regression type models is to assign weakly informative priors to both coefficients and scales such that unreasonable parameter values are excluded, but the priors are not so strong as to rule out even remotely sensible values (Gelman 2006; Seaman, Seaman, and Stamey 2012; Stan Development Team 2017). Following such advice, we use normal priors by setting the means to be 0 and variances to 10,

\begin{array}{l} p (β) \sim N (0, 10) \forall β \\ p (τ_{l}) \sim N (l + 0.5, 10) \forall τ_{l} . \end{array}

As usual, the posterior distribution of the model parameters is given by:

p (β, τ_{l} | y) \propto L (y | β, τ_{l}) p (β, τ_{l}),

where $p (β, τ_{l}) = p (β) p (τ_{l})$ . A Bayesian nonparallel cumulative logit model is a straightforward extension that permits β’s to vary across cut-point equations, so for the likelihood we have

L (y | β_{l}, τ_{l}) = \prod_{​}^{​} (Λ (τ_{l} - x β_{l}) - Λ (τ_{l - 1} x β_{l})) .

and the prior distributions for $β_{l}$ 's across all cut-point equations are set to be

p (β_{l}) \sim N (0, 10) \forall β_{l},

with everything else stays the same as they are in the parallel cumulative logit model.

As discussed in Fullerton and Xu (2016), there are multiple ways to test the PL assumption in ordered regression models within a Bayesian framework. One method is to examine central intervals of the posterior of our interested quantity (e.g., the difference in two coefficients), for example, the 95 percent credible interval. With that, one simply finds the upper and lower bounds of the middle 95 percent of the posterior distribution with the two tails being equal. Based on decision theory, a better alternative is to find the middle 95 percent of the posterior that has densities higher than those outside the interval, hence the highest density interval (HDI) (Kruschke 2015; O’Hagan 2010).

As argued in the previous section, it is unlikely that differences in coefficients across cut-point equations will be exactly zero even in theory, and thus it is common to reject the PL assumption for even minor deviations given sufficient amount of data. One approach to address this issue is to allow for small differences in coefficients by defining a region of practical equivalence (ROPE) interval instead of point estimate for a Bayesian assessment of the PL assumption (Kruschke 2015; Kruschke and Liddell 2017). It can partly address the few problems associated with NHST and largely alleviate the problem of having inflated false positives arising from simultaneous and/or sequential testing of the PL assumption (Kruschke 2015; Kruschke and Liddell 2017). Instead of asking whether a significant mass of posterior distribution contains the null value (e.g., zero), the ROPE + HDI approach uses an interval, for example, [−0.5, 0.5], which could come from some scholarly consensus and be practically viewed as equivalent of the parameter value, usually 0, specified in the null.

Based on Kruschke (2015, 2017) and his colleagues (Carlin and Louis 2009; Hobbs and Carlin 2008), if the HDI (usually 95 percent HDI), the most representative range of the posterior, has no overlap whatsoever with the chosen ROPE for a parameter value, then the null hypothesis can be readily rejected; that is, the parameter value is not credible. If the HDI, however, falls completely within the ROPE, then the null value can be accepted, whereas under the frequentist framework, one can only fail to reject the null. When there is some overlap between the ROPE and the HDI, then one has to withhold the decision since there is not enough empirical evidence. The choice of the radius of a ROPE, of course, can affect the Bayesian assessment of null values. In some cases, prior scholarship may provide guidelines about an appropriate interval for the ROPE. In the absence of such guidelines, there are two recommended methods to define the ROPE interval. One is to examine the proportion (area) overlap between posterior and ROPE as a function of the radius chosen for ROPE so as to make this decision process open and transparent (Kruschke and Liddell 2017). A second method, again suggested by Kruschke (2015), is to set the ROPE interval to be $\pm 0.1 S_{y} / 4 S_{x}$ ; that is, a tenth of a standard deviation (tenth of a standard deviation) in y relative to a change (four standard deviations) that covers almost the full range of x.

Note that while the ROPE + HDI method alleviates some of the problems previously described, it still stays within the conceptual framework under which coefficients are generally viewed as equal or unequal across equations. Next, we move beyond such conceptual limits to postulate that coefficients across different equations are allowed to be different to begin with, but since the data generation processes are usually presumed to be quite similar across different cut-points equations for the cumulative logit model, we also propose to maintain a higher order degree of homogeneity in that variation.

Hyperparameter Cumulative Logit Model

Considerable efforts have been made to adjudicate among parallel, partial, and nonparallel models (Fullerton and Xu 2016; Williams 2006). The parallel model has a clear advantage in parsimony, whereas the nonparallel model usually provides flexibility in parameterization, may increase predictive accuracy, and can accommodate nominality as opposed to strict ordinality of response variables. There are also other lesser known types, such as partial models, proportionality constraints, and stereotype models (Fullerton and Xu 2016, 2018). Despite the variety of these models, methods to select among them generally stay within the traditional NHST paradigm. Coefficients for the same predictors are either statistically not different from one another or they are different. In addition, existing literature does not distinguish between empirical and statistical significance; that is, two coefficients could be statistically different from one another, but they may produce trivial difference in predictions.

Because the likelihood of all coefficients for the same predictors being equivalent across the board is usually rather low especially given a large sample size, an omnibus test of the null hypothesis is often rejected. Conversely, allowing coefficients to vary freely across cut-point equations creates inefficiency and uncertainty about the ordinality of response variables. A common practice is then to let the data inform analysts by testing the PL assumption in an omnibus and then a pairwise (sometimes stepwise) manner, the latter of which can potentially compromise the power of the test (Gelman et al. 2012). Even with the Bayesian assessment of the PL assumption introduced in the previous section, some typical issues (e.g., allowing some coefficients to be different in certain way but not others) described here remain unresolved.

In this article, we propose a new cumulative logit model, in which its coefficients across cut-point equations do not have to be completely different, nor do they have to be exactly the same in the population. Instead, they are allowed to share the same population mean and variance (hyperparameters). This parameterization addresses several challenges that are usually associated with the traditional NHST framework, both conceptually and empirically. First and foremost, this new model recognizes that the PL assumption is simply unrealistic and this view is incorporated into the model setup. This model also acknowledges that there may exist some similarities across different pairs of logit comparisons; that is, the coefficients do not have to be drastically different given the ordinal, as opposed to nominal, nature of response variables and possibly follow similar data generation processes.

So the second approach we outline is the specification of a Bayesian cumulative logit model with hyperparameters that allow for the parameters of the same predictors to come out of the same distributions. This approach permits a degree of variation in the coefficients across cut-point equations but circumscribes these coefficients to share the same mean and variance. Similar to the shrinkage estimators in a multilevel model that uses partial pooling to weigh and thereby improve upon two classical estimates (i.e., no pooling and complete pooling), this hyperparameter cumulative logit model represents a synthesis of a nonparallel model that allows coefficients for each predictor to vary across all pairs of cut-point equations and a parallel model that imposes the restriction that the coefficients are equal across cut-point equations. Conceptually, this new model uses information from all cut-point equations for the same predictors to inform the estimation of specific coefficients so that parameter estimates are pulled toward the shared mean of the coefficient distributions. This is particularly useful when part of the model is unstable or even nonestimable as other parts of the model (i.e., data used to estimate other coefficients of the same predictors) can facilitate the estimation. In addition, this model simultaneously estimates cut points, coefficients, and their hyperparameters, all of which can be used for post-estimation analyses, interpretations, and predictions.

The Bayesian hyperparameter nonparallel cumulative logit model builds on the Bayesian nonparallel cumulative logit model specified previously in equations (13) and (14) by postulating the distributions for the coefficients as coming from same distributions (e.g., normal) with unknown means and variances (hyperparameters) and then specifying hyper-priors for these hyperparameters. For this new model, equation (14) is modified as follows,

p (β_{l}) \sim N (μ, σ) \forall β_{l},

and then hyper-priors are added to the model:

\begin{array}{l} p (μ) \sim N (0, 10) \forall μ_{l} \\ p (σ) \sim Γ (1.0E-3, 1.0E-3) \forall μ_{l} . \end{array}

with the mean of coefficients following a normal, and the standard deviation following a Gamma distribution, and their corresponding hyper-priors are given in the parentheses.

Empirical Examples

The first empirical example examines sociodemographic predictors of self-rated health using data from the 2012 General Social Survey (Smith et al. 2019). Self-rated health is measured on an ordinal scale from one to four, denoting excellent, good, fair, and poor health, respectively. Predictors include age (in years), gender (male = 1 and female = 0), marital status (married = 1 and otherwise = 0), race (black and other race/ethnicity with white as the reference category), education (in years), and imputed log of family income. We use list-wise deletion to deal with missing data, which results in an estimation sample of 1,300.

For this example we use Stan Version 2.9 with the No-U-Turn Sampler (or NUTS) that optimizes Hamilton Monte Carlo adaptively to simulate the posterior (Carpenter et al. 2017; Gelman et al. 2014; Stan Development Team 2017). The tune, warm-up, and saved step sizes are set to be 2,000, 5,000, and 10,000, respectively, and three chains are used (Gelman et al. 2014; Hoffman and Gelman 2014; Kruschke 2015).

Table 1 reports descriptive statistics and estimates from both a standard parallel cumulative logit model (panel 1) and a Bayesian parallel cumulative logit model (panel 2) predicting self-rated health. Consistent with theoretical expectations, we see that age is associated with worse health, while education, income, and being married are associated with better health. We also note that the Bayesian model returns estimates of the parameters quite similar to those from the standard parallel cumulative logit model, except that the cut points are shifted parallelly due to the demeaned income in the Bayesian model for illustrative purposes.

Table 1.

Parallel Cumulative Logit Regression of Self-reported Health.

Parameters	MLE			Bayesian
Parameters	Coefficient	2.50 Percent	97.50 Percent	Mean	2.50 Percent	97.50 Percent	Mean	SD
Self-rated health							2.072	0.854
Age	0.020***	0.014	0.026	0.020	0.014	0.026	48.200	17.432
Gender	0.011	−0.197	0.218	0.011	−0.198	0.219	0.444	0.479
Married	−0.294*	−0.519	−0.068	−0.294	−0.518	−0.069	0.468	0.499
Black	0.068	−0.221	0.357	0.068	−0.222	0.356	0.152	0.359
Other	0.044	−0.302	0.390	0.043	−0.304	0.390	0.102	0.303
Education	−0.092***	−0.130	−0.054	−0.093	−0.131	−0.054	13.513	3.151
Impinc	−0.274***	−0.386	−0.162	−0.275	−0.386	−0.164	9.860	1.155
Cut point 1	−4.230***	−5.257	−3.203	−1.535	−2.193	−0.879
Cut point 2	−2.068***	−3.073	−1.062	0.634	−0.020	1.291
Cut point 3	−0.221	−1.229	0.788	2.490	1.814	3.174

Note: N = 1,300. Impinic = imputed income.

*p < .05. **p < .01. ***p < .001.

Online Appendix 1-A (which can be found at http://smr.sagepub.com/supplemental/) reports various omnibus tests of the PL assumption for all predictors. All tests indicate that the PL assumption is violated for at least one of the predictors. Results from individual Brant tests (Online Appendix 1-B, which can be found at http://smr.sagepub.com/supplemental/) show that education and the black binary indicator of race fail to satisfy the PL assumption.

To further examine the source of the violation, we focus on testing the PL assumption for the coefficients of black and education. We first run a nonparallel cumulative logit model and then use this model as our baseline to conduct NHST (except when we run the Brant test using the parallel model). The left panel of Table 2 contains results from running a nonparallel cumulative model, and the results for testing the PL assumption for black and education are in Online Appendices 1-C, D, and E (which can be found at http://smr.sagepub.com/supplemental/). For both variables, we first examine whether their coefficients across cut-point equations are different in a pairwise manner and then in tandem with one another. With the significance level set to the traditional 0.05 level, the results for education are consistent across different tests that there is a clear violation of the PL assumption, but there exists some discrepancy as regards the black indicator variable. When we compare the black coefficients from the second (excellent and good vs. fair and poor) and the third equations (excellent, good, and fair vs. poor), the Wald, LR, and score tests all agree that the coefficients are pairwise different, and they are also consistent with one another that the difference in the black coefficients between the first (excellent vs. the rest) and second (excellent and good vs. fair and poor) equations is not statistically significant. But when we test whether the black coefficients are equal to one another as a group, these three tests come out differently, with the score test rejecting, LR marginally not rejecting, and the Wald test not rejecting the PL assumption. This example showcases that results from the three tests in nonasymptotic cases may be contradictory to one another, and usually the score and especially the LR test appear to be more conservative than the Wald test as indicated by prior scholarship (Engle 1984; Fullerton and Xu 2016). While examining various model fit statistics in Online Appendix 1-F (which can be found at http://smr.sagepub.com/supplemental/), it is hardly convincing that one model is necessarily better than the other between the parallel and nonparallel cumulative logit models.

Table 2.

Nonparallel Cumulative Logit Regression of Self-reported Health.

Parameters	MLE			Bayesian
Parameters	Coefficient	2.50 Percent	97.50 Percent	Mean	2.50 Percent	97.50 Percent
Equation y > 1
Age	0.016***	0.008	0.023	0.016	0.008	0.024
Gender	−0.149	−0.403	0.106	−0.144	−0.397	0.109
Married	−0.410**	−0.688	−0.132	−0.403	−0.678	−0.128
Black	0.231	−0.156	0.618	0.244	−0.138	0.640
Other	0.086	−0.338	0.510	0.099	−0.323	0.521
Education	−0.054*	−0.101	−0.008	−0.057	−0.104	−0.012
Impinc	−0.193**	−0.336	−0.051	−0.204	−0.346	−0.063
Equation y > 2
Age	0.022***	0.015	0.030	0.023	0.015	0.030
Gender	0.147	−0.113	0.407	0.146	−0.116	0.406
Married	−0.211	−0.492	0.070	−0.215	−0.495	0.067
Black	0.069	−0.285	0.422	0.060	−0.295	0.412
Other	0.038	−0.398	0.474	0.032	−0.409	0.467
Education	−0.130***	−0.177	−0.083	−0.129	−0.175	−0.083
Impinc	−0.310***	−0.445	−0.175	−0.313	−0.446	−0.180
Equation y > 3
Age	0.026***	0.013	0.038	0.026	0.014	0.039
Gender	0.415	−0.044	0.874	0.404	−0.060	0.873
Married	−0.182	−0.680	0.316	−0.212	−0.713	0.282
Black	−0.659	−1.402	0.084	−0.704	−1.498	0.014
Other	−0.439	−1.277	0.398	−0.507	−1.408	0.294
Education	−0.108**	−0.185	−0.032	−0.110	−0.186	−0.035
Impinc	−0.458***	−0.666	−0.250	−0.443	−0.638	−0.232
Cut point 1	3.184***	1.892	4.476	−1.310	−2.108	−0.528
Cut point 2	2.664***	1.460	3.868	0.426	−0.363	1.213
Cut point 3	1.723	−0.220	3.665	2.800	1.478	4.137

Note: N = 1,300. Impinic = imputed income.

*p < .05. **p < .01. ***p < .001.

The right panel of Table 2 reports the estimates from a Bayesian nonparallel cumulative logit model that serves as the basis for assessing the PL assumption. Again, it is not surprising that the estimates are quite similar to those produced by our ML estimation presented in the left panel of Table 2, given that weakly informative priors are used. To assess the PL assumption, we use both HDIs and ROPEs. While using some version of the credible interval (either central or highest density) alone, we simply look at whether the interval contains zero or not. For ROPEs, there are only some general guidelines for an assessment of null values as discussed in the previous section; to reiterate, if the ROPE we set for the parameter difference between two slopes is, for example, from −0.5 to 0.5 (i.e., a difference of 0.5 is not practically different from zero), and this region falls completely outside the 95 percent HDI, we can safely reject the null. If however the ROPE completely entails the 95 percent HDI, then we will accept the null. When there is just some overlap between the ROPE and HDI, we have to withhold our decision since there is not enough empirical evidence.

Figure 2 and 3 provide graphical displays of the results from our Bayesian inferential analysis of the PL assumption for the black and education coefficients, respectively. The left panel of both figures displays the posterior distribution and 95 percent HDI for a given ROPE radius, and the right panel graphs the proportion (area) of posterior within specified ROPEs as a function of the ROPE radius. Based on HDIs or simply credible intervals, the null value for two pairwise slope comparisons (the second and third graphs in Figure 2, that is, Eq1Black-Eq3Black and Eq2Black-Eq3Black) can be rejected since zero is outside the 95 percent HDI. But this conclusion can be misleading or premature because of, for example, the inconsistency between statistical and empirical significance or can be insufficient since sometimes we want to know whether we can accept the null value. Our following discussion focuses on left panels of the two figures first.

Figure 2.

Testing black coefficients equivalence using regions of practical equivalence.

Figure 3.

Testing education coefficients equivalence using regions of practical equivalence.

Following the $\pm 0.1 S_{y} / 4 S_{x}$ rule suggested by Kruschke (2017),¹ we instead use a ROPE radius of 0.2 (−0.2, 0.2), roughly the smallest standard deviation of all three slope parameters in this case, to assess the null value and test the hypothesis. To illustrate, for example, the first graph on the left panel of Figure 2 displays the posterior distribution of the difference between the black coefficients from the first and second equations. The histogram is the posterior distribution of the difference in the two coefficients. The bold black horizontal line at the bottom shows the two limits of the 95 percent HDI of this posterior, in this case, −0.234 and 0.62. The two dotted vertical lines correspond to the lower (−0.2) and upper (0.2) bounds of the chosen ROPE radius. Based on our calculation, 50 percent of the posterior are in the chosen ROPE’d area. Thus, we have to withhold our decision as to whether these two coefficients are statistically different. In fact, for black coefficients from the three cut-point equations, we have to withhold our decision since the chosen ROPE ( $\pm 0.2$ ) range overlaps with the 95 percent HDIs in all three graphs of the left panel in Figure 2, although the extent to which they overlap varies with each comparison.

Similarly, Figure 3 provides a graphical display of the results from the Bayesian assessment of the PL assumption of the coefficients for education. Since it makes more sense to consider how ROPE is related to HDI in Bayesian inference, we focus on findings using a ROPE radius of 0.02. It can be shown that the two education slope parameters between the first and second equations are viewed as different since the 95 percent HDI is completely outside the specified ROPE. For the other two pairs of parameters, we have to suspend our judgment.

A different ROPE could, of course, lead to different conclusions concerning the validity of the PL assumption. For example, had we used a different ROPE radius, we would reject the null value about the difference between the black coefficients from the first and third equations (3 percent in ROPE with a radius of 0.2). Such a change would have to be more substantial in comparing the black coefficients between the second and the third equations (5 percent in ROPE with a radius of 0.2) and for the comparison between the first and second equations (50 percent in ROPE with a radius of 0.2).

To investigate how our results would vary with different ROPE radiuses, we also graph the proportion (area) of posterior within specified ROPEs as a function of the ROPE radius. These graphs are juxtaposed to the right of their corresponding graphs for (ROPE) radius-specific posterior distributions, and they in general agree with one another. If the curve (usually the low end) is steep, then that means the proportion of posterior in ROPE is quite sensitive to the chosen radius and accordingly it is less likely to reject the null. For example, although we probably have to withhold our decision as to our Bayesian assessment of the equivalence among the black coefficients given the chosen ROPE, we feel safer to infer that the black coefficients from the first and third equations (the second in the right panel of Figure 2) probabilistically bear less similarity than other pairwise comparisons since the amount of posterior in ROPE for the difference between the two coefficients stays low (<.05) before the radius hits 0.3, whereas the other two curves, especially the one for the comparison of the black coefficients between the first and second equations (the first in the right panel of Figure 2) goes up quickly, suggesting almost an infinitesimal likelihood that the ROPEd’ null value can be rejected. The pattern becomes even clearer in relative terms for the comparison of the two education coefficients between the first and second equations (the two graphs in the first row of Figure 3). Unlike the black–white decision rule under the NHST framework, the whole process of Bayesian assessment of null values recognizes uncertainty and is more transparent.

As illustrated, a Bayesian assessment of the PL assumption has a few advantages over the traditional frequentist NHST. First, it allows not only to reject but also accept the null. Second, it avoids the ambivalent and even confusing application of confidence intervals; instead, we can provide a full probabilistic interpretation of our results. Third, the use of ROPE + HDI reduces the odds of having false positives in sequential testing.

Estimation of the Hyperparameter Cumulative Logit Model

In this example, we use the same data from previous sections to fit a hyperparameter nonparallel cumulative logit model (see equation [15] and [16]). This model adds an additional assumption that slopes for the same predictors come from the same (normal) distributions. Table 3 presents Bayesian estimates of hyperparameters, means of slopes and cut points, and their 95 percent credible intervals from the posterior. Comparing the estimated means of slopes from the hyperparameter model with those from the parallel model, we find that both sets of coefficients are quite similar to the extent that all “statistically significant” coefficients are close to each other with regard to magnitude and exactly the same in direction. It is also of note that the mean estimates of statistically “credible” slopes and their credible intervals from the Bayesian hyperparameter nonparallel model (Table 3) are pooled closer to one another (hence the shrinkage estimators) than those from the Bayesian nonparallel model (Table 2) without hyperparameters or shrinkage.

Table 3.

Bayesian Hyperparameter Nonparallel Cumulative Logit Regression of Self-rated Health.

Parameters	Mean	2.50 Percent	97.50 Percent
Hyperparameters
Age	0.020	0.013	0.029
Gender	0.034	−0.205	0.301
Married	−0.300	−0.523	−0.068
Black	0.067	−0.357	0.375
Other	0.016	−0.394	0.373
Education	−0.099	−0.203	0.002
Impinc	−0.281	−0.411	−0.159
Equation y > 1
Age	0.019	0.012	0.026
Gender	−0.011	−0.238	0.209
Married	−0.324	−0.536	−0.084
Black	0.118	−0.208	0.434
Other	0.037	−0.328	0.374
Education	−0.060	−0.105	−0.009
Impinc	−0.270	−0.384	−0.155
Equation y > 2
Age	0.021	0.015	0.027
Gender	0.057	−0.148	0.304
Married	−0.297	−0.518	−0.066
Black	0.088	−0.231	0.348
Other	0.036	−0.306	0.372
Education	−0.122	−0.169	−0.071
Impinc	−0.283	−0.404	−0.175
Equation y > 3
Age	0.022	0.015	0.031
Gender	0.065	−0.167	0.449
Married	−0.293	−0.526	−0.020
Black	−0.009	−0.675	0.327
Other	0.005	−0.400	0.370
Education	−0.112	−0.175	−0.056
Impinc	−0.292	−0.452	−0.177
Cut point 1	−1.112	−1.837	−0.231
Cut point 2	0.331	−0.383	1.064
Cut point 3	2.384	1.465	3.343

Note: N = 1,300. Impinic = imputed income.

Figures 4 and 5 plot predicted probabilities of y = 1, 2, 3, and 4 and their credible intervals from our estimation of the Bayesian parallel cumulative logit and the hyperparameter models, respectively. To compute these predictions and their credible intervals, we hold all other variables to their grand means while varying the values of log of family income. A few findings are worth noting. First, predictions across the two models are similar for one (y = 2) of the four response levels of self-rated health but are different for the other three. Second, the credible intervals for the hyperparameter model are clearly narrower than those of the parallel model for three of the four graphs (y =1, 3, and 4), indicating a superiority in precision.

Figure 4.

Predicted probabilities of four levels of self-rated health with credible intervals from the Bayesian nonparallel cumulative logit model.

Figure 5.

Predicted probabilities of four levels of self-rated health with credible intervals from the Bayesian hyperparameter nonparallel cumulative logit model.

Watanabe–Akaike (or more popularly, the widely applicable) information criterion (Vehtari, Gelman, and Gabry 2017; Watanabe 2010) is also used to evaluate and compare fitted Bayesian models. WAIC is calculated as

- 2 ({\hat{elppd}}_{WAIC}) = - 2 (\hat{lppd} - {\hat{p}}_{WAIC}),

where $\hat{lppd}$ is the estimated log pointwise predictive density, and ${\hat{p}}_{WAIC}$ is the estimated effective number of parameters (Gelman et al. 2014:166-81; Vehtari, Gelman, and Gabry 2017). Because WAIC is fully Bayesian and invariant to parameterization, recent scholarship has preferred WAIC over AIC and DIC in measuring predictive accuracy (Gelman et al. 2014:174). Based on WAIC, it can be shown that the hyperparameter model fares best among the three Bayesian models ( ${WAIC}_{nonparallel} = 3, 016.6$ , ${WAIC}_{parallel} = 3, 021.4$ , and ${WAIC}_{hyperparameter} = 3, 014.7$ ), although the advantage is modest. Results from another popular model predictive accuracy measure, leave-one-out cross-validation also gives very similar results and identical substantive findings.

Second Empirical Example

In this second example, we use data from the Asian sample of the National Latino and Asian American Survey (Alegria et al. 2004) to illustrate how hyperparameter nonparallel cumulative logit model can be used where the traditional methods fail. In this example, we use an eight-level ordinal response to classify different types of body weight normalized by height based on standards set by the World Health Organization for body mass index (BMI). These eight levels include severe thinness (<16), moderate thinness (16–16.99), mild thinness (17–18.49), normal range (18.5–24.99), preobese (25–29.99), obese class I (30–34.99), obese class II (35–39.99), and obese class III (≥40). Age (in years), gender (male = 1), ethnicity (Filipino, Vietnamese, and other Asians with Chinese as the reference category), nativity (U.S. born = 1), marital status (married = 1), education (in years), and log of income (log of household income in dollars) are used as predictors. For this example, we first fit the traditional parallel cumulative logit model using maximum likelihood estimation (MLE), then the same model using Bayesian estimation, and last a Bayesian hyperparameter nonparallel cumulative logit model.

Table 4 presents results from the traditional frequentist parallel cumulative logit model using MLE. Age, being male vs. female, and being married vs. other types of marital status all increase the odds of having a heavier body weight adjusted by height. Being Filipino and other Asians as opposed to Chinese increase one’s odds of having a higher BMI type, whereas being Vietnamese vs. Chinese decreases the odds. Interestingly, both education and income are not associated with BMI types with desired significance level in this sample, though the signs of the coefficients are in expected direction. Not surprisingly, estimates from our Bayesian estimation of the same data and likelihood function with weakly informative priors yield very similar results.

Table 4.

Parallel Cumulative Logit Regression of Body Mass Index (BMI) Categories.

Parameters	MLE			Bayesian
Parameters	Coefficient	2.50 Percent	97.50 Percent	Mean	2.50 Percent	97.50 Percent	Mean	SD
BMI category							4.397	0.932
Age	0.013***	0.006	0.019	0.013	0.006	0.019	41.216	14.750
Male	0.766***	0.588	0.945	0.768	0.590	0.947	0.475	0.499
U.S. born	0.771***	0.546	0.997	0.774	0.547	1.000	0.216	0.412
Filipino	0.993***	0.753	1.232	0.996	0.757	1.240	0.242	0.428
Vietnamese	−0.353**	−0.611	−0.095	−0.353	−0.610	−0.094	0.248	0.432
Other Asian	0.613***	0.364	0.861	0.615	0.366	0.865	0.223	0.416
Married	0.258*	0.050	0.466	0.260	0.052	0.469	0.657	0.475
Education	−0.023	−0.050	0.005	−0.023	−0.050	0.005	13.637	3.429
Log income	0.026	−0.011	0.063	0.026	−0.011	0.063	10.349	2.510
Cut point 1	−3.887***	−4.698	−3.076	−3.966	−4.797	−3.169
Cut point 2	−3.005***	−3.703	−2.306	−3.029	−3.731	−2.345
Cut point 3	−1.49***	−2.118	−0.861	−1.494	−2.119	−0.878
Cut point 4	2.222***	1.596	2.848	2.230	1.612	2.852
Cut point 5	4.051***	3.405	4.696	4.064	3.429	4.706
Cut point 6	5.191***	4.516	5.867	5.215	4.554	5.889
Cut point 7	5.849***	5.136	6.562	5.894	5.199	6.607

Note: N = 2,086.

*p < .05. **p < .01. ***p < .001.

While fitting a parallel cumulative logit model is not difficult, fitting a nonparallel cumulative logit model runs into numerical difficulties using either the MLE or Bayesian approach due in large part to the large number of response levels and the highly skewed distribution of the response variable. Therefore, a Bayesian hyperparameter nonparallel cumulative logit model (see equations [15] and [16]) becomes a natural candidate if we want to move beyond a simple parallel cumulative logit when a nonparallel model is not numerically estimable. Because fitting a Bayesian hyperparameter nonparallel cumulative logit model is computationally expensive, especially when we have an unusually large number of response levels, we need to revise a few initial MCMC simulation setups. We increase both the adapt and tune-in steps to be 5,000, with 285,000 steps (a total of 300,000 iterations) to be saved with three chains. Following Gelman (2006), we use the half-Cauchy distribution as priors for the scale hyperparameters of slopes in the model (see equation [16]), and results are presented in Table 5.²

Table 5.

Bayesian Estimation of Hyperparameter Nonparallel Cumulative Logit Regression of Body Mass Index Categories.

Parameters	Mean	2.50 Percent	97.50 Percent
Hyperparameters
Age	0.015	0.004	0.028
Male	0.493	−0.305	1.192
U.S. born	0.856	0.344	1.426
Filipino	0.999	0.723	1.308
Vietnamese	−0.361	−0.735	−0.004
Other Asian	0.585	0.300	0.864
Married	0.247	−0.009	0.502
Education	−0.005	−0.062	0.100
Log income	0.027	−0.021	0.074
Equation y > 1
Age	0.020	0.005	0.432
Male	0.905	0.005	1.915
U.S. born	0.839	0.024	1.609
Filipino	0.983	0.669	1.311
Vietnamese	−0.413	−0.925	0.045
Other Asian	0.607	0.204	1.081
Married	0.270	−0.053	0.635
Education	0.028	−0.040	0.120
Log income	0.022	−0.070	0.075
Equation y > 2
Age	0.018	0.005	0.360
Male	1.236	0.518	2.015
U.S. born	0.621	−0.099	1.272
Filipino	1.002	0.659	1.395
Vietnamese	−0.288	−0.703	0.208
Other Asian	0.605	0.247	1.148
Married	0.216	−0.145	0.504
Education	0.006	−0.046	0.076
Log income	0.025	−0.047	0.077
Equation y > 3
Age	0.025	0.011	0.435
Male	1.228	0.812	1.659
U.S. born	0.560	0.081	1.060
Filipino	1.014	0.711	1.358
Vietnamese	−0.251	−0.589	0.132
Other Asian	0.556	0.238	0.864
Married	0.301	0.059	0.608
Education	−0.030	−0.073	0.014
Log income	0.023	−0.030	0.066
Equation y > 4
Age	0.011	0.004	0.185
Male	0.807	0.620	1.009
U.S. born	0.683	0.442	0.921
Filipino	0.977	0.736	1.207
Vietnamese	−0.458	−0.753	−0.193
Other Asian	0.610	0.372	0.848
Married	0.245	0.039	0.450
Education	−0.024	−0.051	0.005
Log income	0.029	−0.007	0.068
Equation y > 5
Age	0.008	−0.003	0.189
Male	0.104	−0.192	0.418
U.S. born	1.104	0.781	1.566
Filipino	0.976	0.699	1.245
Vietnamese	−0.472	−0.884	−0.110
Other Asian	0.627	0.332	0.933
Married	0.228	−0.046	0.481
Education	−0.016	−0.052	0.029
Log income	0.026	−0.020	0.068
Equation y > 6
Age	0.013	0.000	0.254
Male	−0.300	−0.838	0.218
U.S. born	1.286	0.802	1.761
Filipino	1.024	0.712	1.369
Vietnamese	−0.247	−0.678	0.275
Other Asian	0.570	0.207	0.877
Married	0.226	−0.083	0.510
Education	−0.008	−0.056	0.047
Log income	0.024	−0.027	0.071
Equation y > 7
Age	0.011	−0.004	0.254
Male	−0.422	−1.152	0.231
U.S. born	0.767	0.192	1.321
Filipino	1.009	0.679	1.388
Vietnamese	−0.366	−0.893	0.152
Other Asian	0.591	0.222	0.881
Married	0.254	−0.046	0.569
Education	0.003	−0.049	0.072
Log income	0.029	−0.020	0.084
Cut point 1	−3.353	−4.692	−2.139
Cut point 2	−2.442	−3.433	−1.413
Cut point 3	−1.019	−1.947	0.046
Cut point 4	2.137	1.519	2.777
Cut point 5	3.669	2.837	4.655
Cut point 6	5.051	4.063	6.073
Cut point 7	5.706	4.539	2.816

Note: N = 2,086.

The hyperparameter means of slopes from the hyperparameter model are similar to those from both the MLE and the Bayesian parallel model for some predictors but much less so for others; the estimates for Filipino, Vietnamese, other Asians, married, and log of income in three models are quite close to one another, somewhat different for age and U.S. born, and dissimilar for male and education. It happens that in this case, coefficients similar across models are all statistically significant, but for others that are dissimilar (i.e., male and education), they tend to be statistically insignificant across all models. Based on WAIC, it can be shown again that the Bayesian hyperparameter nonparallel cumulative logit model (WAIC = 4,559.8) clearly outperform the Bayesian parallel cumulative logit model (WAIC = 4,610.7). With estimates from the hyperparameter model, we can use hyperparameter means like we do in a parallel model to make predictions; alternatively, we could use estimate from those cut-point equations to conduct post-estimation analyses just like in a nonparallel model. We can even use some of the hyperparameter means for a subset of the predictors and estimates from cut-point equations for other variables, like in a partial model, for further analyses. This example clearly illustrates the utility of a hyperparameter model when the nonparallel model is numerically unestimable using either maximum likelihood or the Bayesian estimation method.

Practical Considerations

Despite several major advantages of using Bayesian estimation as illustrated in our article, there is one temporary limiting feature of this unorthodox statistical paradigm—computational cost. In this article, we ran three types of nonparallel Bayesian ordinal regression models using two different data sets. The amount of time required of such analyses depends on a few other factors than mere model complexity including sample size, adapt, burn-in, and final sampling steps. In addition, the elapsed time can vary drastically from chain to chain, especially with the MCMC used in Stan. When the results for this article were obtained, it took about one hour to estimates a Bayesian parallel cumulative logit model and three hours for the Bayesian nonparallel cumulative logit model, both with three chains, compared with a few seconds using the frequentist MLE, in a typical statistical environment (e.g., R, SAS, or Stata) on a typical personal computer (Intel i7-3770 CPU 3.4 GHz and 24 GB RAM). For the Bayesian hyperparameter nonparallel cumulative logit model, it took us about six hours to get convergence, and its frequentist counterpart is not yet available for estimation. There are various ways we can speed up the process, for example, using standardized variables and certain types of priors, such as half-Cauchy, for scale parameters. Given the high computational cost, Bayesian estimation of cumulative ordered logit is usually preferred when one intends to use informative priors and/or complex reparameterization. Of course, the concern about some fundamental flaws of NHST and an embrace of the whole new Bayesian inferential framework is another important consideration.

Conclusion

This article outlines Bayesian approaches to address several long-standing issues related to the PL assumption with ordered regression models. First, this article introduces a Bayesian assessment of null values as a useful alternative for testing the PL assumption. While the traditional NHST framework has reached a high level of sophistication, it suffers from several major limitations. First and foremost, testing the PL assumption under the NHST framework simply recreates part or whole of the Meehl’s paradox that the null here is inherently untenable; it is unrealistic in presuming that all slopes for the same predictors are exactly the same across different cup-point equations, and even a minor deviation may lead to the rejection of the PL assumption. Moreover, one may have to go through a stepwise procedure to find the source of violation of the PL assumption, which may compromise the power of the test and inflate false alarms. Third, the traditional NHST framework can only allow one to fail to reject but not to accept the null; last but not least, under the old paradigm, one can only calculate the probability of observing the data given the null is true, whereas what is often sought after is the probability that the null is true given the data. The use of Bayesian assessment of null values through HDIs + ROPEs can alleviate or avoid some of these problems.

A second major contribution of this article is to propose a Bayesian hyperparameter nonparallel cumulative logit model, providing a fresh new perspective on how to deal with possible violations of the PL assumption. Instead of having to choose between parallel and nonparallel slopes, or some partial version in between, this new model allows slopes to be somewhat different to the extent that their means come from the same distribution dominated by hyperparameters. Our empirical examples show promising results for this new model.

Supplemental Material

Supplemental Material, Supplemental_Material - Bayesian Approaches to Assessing the Parallel Lines Assumption in Cumulative Ordered Logit Models

Supplemental Material, Supplemental_Material for Bayesian Approaches to Assessing the Parallel Lines Assumption in Cumulative Ordered Logit Models by Jun Xu, Shawn G. Bauldry and Andrew S. Fullerton in Sociological Methods & Research

Footnotes

Authors’ Note

An earlier version of this article was presented at the 2017 ASA Methodology Midyear Meeting in Chicago. Usual disclaimers apply.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Jun Xu

Supplemental Material

Supplementary material for this article is available online.

Notes

References

Agresti

Alan

. 2010. Analysis of Ordinal Categorical Data. Hoboken, NJ: John Wiley & Sons.

Akaike

Hirotugu

. 1974. “A New Look at the Statistical Model Identification.” IEEE Transactions on Automatic Control AC-19(6):716–23.

Alegría

Takeuchi

Canino

Duan

Shrout

Meng

Vega

Zane

Vila

Woo

Vera

Guarancia

Aguilar-Gaxiola

Sue

Escobar

Lin

Gong

. 2004. “Considering Context, Place and Culture: The National Latino and Asian American Study.” International Journal of Methods in Psychiatric Research 13:208–220.

Bliss

1934a. “The Method of Probits.” Science 79(2037):38–39.

Bliss

1934b. “The Method of Probits: A Correction.” Science 79(2053):409–410.

Brant

Rollin

. 1990. “Assessing Proportionality in the Proportional Odds Model for Ordinal Logistic Regression.” Biometrics 46(4):1171–78.

Buse

1982. “The Likelihood Ratio, Wald, and Lagrange Multiplier Tests: An Expository Note.” The American Statistician 36(3):153–57.

Carlin

Bradley P.

Louis

Thomas. A.

. 2009. Bayesian Methods for Data Analysis. Boca Roton, FL: CRC Press of Taylor & Francis.

Carpenter

Bob

Gelman

Andrew

Hoffman

Matthew

Lee

Daniel

Goodrich

Ben

Betancourt

Michael

Brubaker

Marcus

Guo

Jiqiang

Peter

Riddell

Allen

. 2017. “Stan: A Probabilistic Programming Language.” Journal of Statistical Software 76(1):1–32.

10.

Cox

Christopher

. 1995. “Location-scale Cumulative Odds Models for Ordinal Data: A Generalized Non-linear Model Approach.” Statistics in Medicine 14:1191–203.

11.

Engle

Robert F.

1984. “Wald, Likelihood Ratio, and Lagrange Multiplier Tests in Econometrics.” Pp. 775–826 in Handbook of Econometrics, II, edited by Briliches

Intriligator

M. D.

. Amsterdam, the Netherlands: Elsevier.

12.

Finney

1947. Probit Analysis: A Statistical Treatment of the Sigmoid Response Curve. Cambridge, England: Cambridge University Press.

13.

Fullerton

Andrew S.

2009. “A Conceptual Framework for Ordered Logistic Regression Models.” Sociological Methods & Research 38(2):306–347.

14.

Fullerton

Andrew S.

Jun

. 2016. Ordered Regression Models: Parallel, Partial, and Non-parallel Alternatives. edited by Gill

Heeringa

S. G.

Long

J. S.

Snijders

T. A. B.

Linden

W. J. v. d.

. New York: Chapman & Hall/CRC.

15.

Fullerton

Andrew S.

Jun

. 2018. “Constrained and Unconstrained Partial Adjacent Category Logit Models for Ordinal Response Variables.” Sociological Methods & Research 47(2):169–206.

16.

Gelman

Andrew

. 2006. “Prior Distributions for Variance Parameters in Hierarchical Models.” Bayesian Analysis 1(3):515–34.

17.

Gelman

Andrew

Carlin

John B.

Stern

Hal S.

Dunson

David B.

Vehtari

Aki

Rubin

Donald B.

. 2014. Bayesian Data Analysis. Boca Raton, FL: CRC Press.

18.

Gelman

Andrew

Hill

Jennifer

Yajima

Masanao

. 2012. “Why We (Usually) Don’t Have to Worry About Multiple Comparisons.” Journal of Research on Educational Effectiveness 5:189–211.

19.

Greene

William H.

Hensher

David A.

. 2012. Modeling Ordered Choices. New York: Cambridge University Press.

20.

Hobbs

Brian. P.

Carlin

Bradley P.

. 2008. “Practical Bayesian Design and Analysis for Drug and Device Clinical Trials.” Journal of Biopharmaceutical Statistics 18(1):54–80.

21.

Hoffman

Matthew D.

Gelman

Andrew

. 2014. “The No-U-turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research 15:1593–623.

22.

Imai

Kosuke

King

Gary

Lau

Olovia

. 2008. “Toward a Common Framework for Statistical Analysis and Development.” Journal of Computational and Graphical Statistics 17(4):1–22.

23.

Kruschke

John K.

2011. Doing Bayesian Data Analysis: A Tutorial with R and Bugs. Burlington, MA: Elsevier.

24.

Kruschke

John K.

2013. “Bayesian Estimation Supersedes the T Test.” Journal of Experimental Psychology: General 142(2):573–603.

25.

Kruschke

John K.

2015. Doing Bayesian Data Analysis. San Diego, CA: Academic Press.

26.

Kruschke

John K.

Liddell

Torrin M.

. 2017. “The Bayesian New Statistics: Hypothesis Testing, Estimation, Meta-analysis, and Power Analysis from a Bayesian Perspective.” Psychonomic Bulletin & Review: 1–29. doi: 10.3758/s13423-016-1221-4.

27.

Lacy

Michael G.

2006. “An Explained Variation Measure for Ordinal Response Models with Comparisons to Other Ordinal R2 Measures.” Sociological Methods & Research 34(4):469–520.

28.

Long

J. Scott

. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand oaks, CA: Sage.

29.

McCullagh

Peter

. 1980. “Regression Models for Ordinal Data.” Journal of the Royal Statistical Society. Series B (Methodological) 42(2):109–42.

30.

McCullagh

Peter

Nelder

John

. 1989. Generalized Linear Models. London, England: Chapman and Hall.

31.

McFadden

Daniel L.

1973. “Conditional Logit Analysis of Qualitative Choice Behavior.” Pp. 105–142 in Frontiers in Econometrics, edited by Zarembka

. New York: Academic Press.

32.

McKelvey

Richard D.

Zavoina

William

. 1975. “A Statistical Model for the Analysis of Ordinal Level Dependent Variables.” The Journal of Mathematical Sociology 4(1):103–120. doi: 10.1080/0022250X.1975.9989847.

33.

Meehl

Paul E.

1967. “Theory-Testing in Psychology and Physics: A Methodological Paradox.” Philosophy of Science 34(2):103–15.

34.

Meehl

Paul. E.

1997. “The Problem Is Epistemology, Not Statistics: Replace Significance Tests by Confidence Intervals and Quantify Accuracy of Risky Numerical Predictions.” Pp. 395–425 in What If There Were No Significance Tests, edited by Erlbaum Harlow

L. L.

Mulaik

S. A.

Steiger

J. H.

Mahwah, NJ.

35.

O’Hagan

Anthony

. 2010. Kendalls Advanced Theory of Statistic 2b. New York: Wiley.

36.

Powers

Daniel

Xie

. 2009. Statistical Methods for Categorical Data Analysis. St. Louis, MO: Elsevier Science.

37.

Raftery

Adrian E.

1995. “Bayesian Model Selection in Social Research.” Sociological Methodology 25:111–63.

38.

Schwarz

Gideon

. 1978. “Estimating the Dimension of a Model.” Annals of Statistics 6(2):461–64.

39.

Seaman

John W.

Seaman

John W.

Stamey

James D.

. 2012. “Hidden Dangers of Specifying Noninformative Priors.” The American Statistician 66(2):77–84. doi: 10.1080/00031305.2012.695938.

40.

Smith

Tom W.

Davern

Michael

Freese

Jeremy

Morgan

Stephen L.

. 2019. General Social Surveys (1972-2018) [machine-readable data file]. NORC ed. Chicago, IL: NORC (principal investigator: Tom W. Smith; co-principal investigators: Michael Davern, Jeremy Freese, and Stephen L. Morgan; sponsored by National Science Foundation).

41.

Stan Development Team. 2017. “Stan Modeling Language Users Guide and Reference Manual, Version 2.17.0.” Retrieved April 11, 2017 (http://mc-stan.org.)

42.

Vehtari

Aki

Gelman

Andrew

Gabry

Jonah

. 2017. “Practical Bayesian Model Evaluation Using Leave-one-out Cross-validation and WAIC.” Statistics and Computing 27(5):1413–32. doi: 10.1007/s11222-016-9696-4.

43.

Watanabe

Sumio

. 2010. “Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory.” Journal of Machine Learning Research 11:3571–94.

44.

Williams

Richard

. 2006. “Generalized Ordered Logit/ Partial Proportional Odds Models for Ordinal Dependent Variables.” The Stata Journal 6(1):58–82.

45.

Williams

Richard

. 2016. “Understanding and Interpreting Generalized Ordered Logit Model.” Journal of Mathematical Sociology 40(1):7–20.

46.

Winship

Christopher

Mare

Rober

. 1984. “Regression Models with Ordinal Variables.” American Sociological Review 49(4):512–25.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.02 MB