An Additional Measure of Overall Effect Size for Logistic Regression Models

Abstract

Users of logistic regression models often need to describe the overall predictive strength, or effect size, of the model’s predictors. Analogs of R ² have been developed, but none of these measures are interpretable on the same scale as effects of individual predictors. Furthermore, R ² analogs are not invariant to the base rate (overall proportion of successes), making it difficult to compare effect sizes across data sets. The authors propose a measure of overall effect size that is interpretable on the same scale as effects of individual predictors and is invariant to the base rate. They explore the properties of the overall odds ratio and illustrate its use through an example. They also provide interpretive guidance and illustrate how statistical software can be used to compute the proposed measure.

Keywords

logistic regression overall effect size overall odds ratio

Introduction

In this article, we propose an overall effect size measure for multiple logistic regression (MLOGR) models. We first discuss common measures of overall effect size: classical R ² applied to multiple linear regression (MLR) and R ² analogs applied to other generalized linear models (GLMs). We then discuss how the population variance of the model’s multiple linear predictor (MLP) represents overall effect size for GLMs and for MLOGR how the population overall odds ratio is a simple function of the variance of the MLP. Next, we show how to estimate the variance of the MLP (and overall odds ratio) using the sample variance of the MLP. Because this variance estimator (and overall odds ratio estimator) is generally biased, we propose a method to correct the overestimation problem. Then through a simulation study, we explore the properties of the overall odds ratio estimator. An example is presented where the overall odds ratio is used to measure overall effect size of models for college retention. We provide practical guidance, including effect size interpretation and a SAS macro, for researchers who wish to use the proposed measure. We then summarize our findings and discuss directions for further research.

Measures of Overall Effect Size: Classical R ² and R ² Analogs

Under the fixed effects MLR model, E(Y) = η where $η = β_{0} + \sum_{k = 1}^{p} x_{k} β_{k}$ , η is the MLP, β₀ is an intercept parameter, (β₁, β₂, . . . β _p ) are a set of regression coefficients, and (x ₁, x ₂, . . . x _p) are observed values of predictor variables. In MLR analysis, R ² is a useful measure of how well the model’s predictor variables explain the outcome. This classical version of R ², also known as the coefficient of determination, is defined as the proportion of variation in the outcome that is explained by a set of predictor variables; hence, $R^{2} = \frac{\hat{var} (\hat{η})}{\hat{var} (Y)}$ . Classical R ² ranges from 0 to 1; as R ² increases, so does the model’s predictive strength. The multiple correlation coefficient, R, is simply the square root of R ² and is also equal to the correlation of the model’s fitted and observed values. When the observed outcome data, y, have been standardized (mean zero and standard deviation one), $R^{2} = \hat{var} (\hat{η})$ and R is the estimated change in E(Y) associated with a unit increase in the weighted sum of the predictors, where the weights (regression coefficients) reflect each predictor’s contribution. When the predictor variables and outcome variable have been standardized, R is interpretable on the same scale as the individual regression coefficients; in this sense, R represents the model’s overall regression coefficient. Given their clear interpretations and the need to describe a model’s overall predictive strength, R ² and R have gained acceptance as standard measures of the overall effect size for MLR models.

MLOGR is used across all disciplines to relate a set of predictor variables, (x ₁, x ₂, . . . x _p), to a dichotomous outcome, Y. Under the fixed effects MLOGR model with a 0,1 coding for Y, E(Y) = {1 + exp(–η)}^–1 where $η = β_{0} + \sum_{k = 1}^{p} x_{k} β_{k}$ , η is the MLP, β₀ is an intercept parameter, and (β₁, β₂, . . . β_p) are a set of regression coefficients. Because the form of the MLP is the same as that used in MLR, MLOGR can be thought of as an extension of MLR for dichotomous outcomes.

Users of MLOGR have the same need as users of MLR to describe the overall effect size of models with multiple predictors. To meet this need, R ² analogs have been developed that share some properties with classical R ². Cox and Snell (1989) described a generalization of R ² for any linear model as a function of the intercept, sample size n, and likelihood of the fitted model. This R ² analog, which Menard (2000) referred to as the “unadjusted geometric mean square improvement” and called $R_{M}^{2}$ , simplifies to classical R ² for the special case of MLR but does not range from 0 to 1 in general. For models involving discrete data, such as MLOGR, $R_{M}^{2}$ achieves a maximum that is less than one. Nagelkerke (1991) proposed an adjusted version that is equal to $R_{M}^{2}$ divided by its maximum, hence ranging from zero to one. Another R ² analog that has been used in MLOGR is referred to by Menard (2000) as the “ordinary least squares” R ² (see Agresti, 1990), which we call $R_{O}^{2}$ . Similar to classical R ², $R_{O}^{2}$ is equal to the squared correlation of the model’s fitted and observed values. The “log likelihood ratio” R ², (which we call $R_{L}^{2}$ ) described by McFadden (1974), is a measure that is reported in common software packages, including SPSS and STATA. Aldrich and Nelson (1984) proposed an R ² analog that is a variant of the contingency coefficient; we call this analog $R_{C}^{2}$ . Menard (2002) provided an outstanding review of these R ² analogs as well as a few others.

Menard (2000) examined five of the R ² analogs applicable to MLOGR models, including $R_{M}^{2}$ (adjusted and unadjusted versions), $R_{O}^{2}$ , $R_{C}^{2}$ , and $R_{L}^{2}$ . Of these five measures, only $R_{L}^{2}$ appeared to have little correlation with the overall proportion of successes (the base rate)—the other R ² analogs had substantial correlations (all greater than .50) with the base rate for a sample of seven MLOGR models where the base rates varied from .01 to approximately .50. Because of this dependence on the base rate, it is not meaningful to use most R ² analogs to compare the predictive strength of two MLOGR models when the base rates differ.

Menard (2002) suggested that $R_{L}^{2}$ is the most appropriate analog for MLOGR. Shtatland, Kleinman, and Cain (2002) agreed with Menard’s suggestion. Both authors cited $R_{L}^{2}$ ’s interpretability and relative invariance to the base rate as reasons for using $R_{L}^{2}$ over other analogs. This measure is defined as $R_{L}^{2} = 1 - \frac{ln (L_{M})}{ln (L_{0})}$ where L _M is the likelihood function for the model containing all of the predictors and L ₀ is the likelihood function for the model containing only the intercept. Menard (2000) wrote that $R_{L}^{2}$ has “the most intuitively reasonable interpretation as a proportional reduction in error measure, parallel to classical R ²” (p. 24). So, the general consensus in the literature is that $R_{L}^{2}$ should be regarded as the standard R ² measure for MLOGR due to its relative invariance to the base rate and interpretability. Still, unlike classical R ², $R_{L}^{2}$ leaves users without a measure of overall effect size that is intuitive and interpretable on the same scale as the effects of individual predictors. With a classical R ² value of .49 (R =.70) for example, we can say “the model’s predictors explain 49% of the variation in the outcome” and “the mean of the outcome increases by .70 standard deviations for each one standard deviation increase in the linear predictor.” With $R_{L}^{2}$ , we are not sure of the impact of the model’s predictors on the outcome. For example, what is the meaning of $R_{L}^{2}$ = .10 in terms of changes in probability or odds? In the following section, we introduce the population variance of the MLP (PVMLP) as a measure of overall effect size for GLMs that is invariant to base rates. We describe how this measure can be converted to an overall odds ratio, which is a more intuitive measure of overall effect size.

The Population Variance of the Multiple Linear Predictor

Classical R ² is defined as the estimated proportion of variation in the outcome that is explained by the model’s predictor variables. Let ρ² represent the proportion of variation in the population’s outcomes that is explained by the MLR model. Hence, $ρ^{2} = \frac{var (η)}{var (Y)}$ where var(η) is the population variance of the MLP. When var(Y) =1, ρ² = var(η) can be written in terms of the regression parameters and variances and covariances of predictor variables as follows: $ρ^{2} = var (η) = var (β_{0} + \sum_{k = 1}^{p} x_{k} β_{k}) = \sum_{k = 1}^{p} β_{k}^{2} var (x_{k}) + \sum_{l = k + 1}^{p} 2 β_{k} β_{l} cov (x_{k}, x_{l})$ . When the predictor variables have been standardized, $ρ^{2} = \sum_{k = 1}^{p} β_{k}^{2} + \sum_{l = k + 1}^{p} 2 β_{k} β_{l} corr (x_{k}, x_{l})$ . In this case, each regression parameter represents the change in E(Y) for each standard deviation change in x. For any constant k, because E{y|η = k + var^1/2(η)} – E{y|η = k} = k + var^1/2(η) – k = var^1/2(η) = ρ, ρ can be interpreted as the overall regression coefficient—it represents the expected (mean) change in Y for each standard deviation change in η, the weighted sum of the model’s predictors where the weights are given by the regression coefficients.

GLMs are applied to outcome variables whose distribution is in the exponential family, taking the form exp{(Y – b(θ))/a(φ) + c(y, φ)} (McCullagh & Nelder, 1989, p. 28). Given this functional form, the mean of the outcome variable Y is b′(θ) and the variance is b″(θ)a(φ) (McCullagh & Nelder, 1989, p. 29). Hence, a(φ) is the component of the variance that is independent of the mean and the quantity $\frac{var (η)}{a (φ)}$ can be described generally as the ratio of variation in Y explained by the GLM and total variation that is independent of the mean. For the case of MLR, var(Y) = a(φ) (i.e., the variance is completely independent of the mean) and so $ρ^{2} = \frac{var (η)}{a (φ)}$ . Because a(φ) =1 for MLOGR and some other GLMs (i.e., Poisson regression), a parameter analogous to ρ² is defined as var(η). This parameter, the population variance of the multiple linear predictor (PVLMP), measures the predictive strength of any GLM and is analogous to ρ² when a(φ) = 1.

var (η) = \sum_{k = 1}^{p} β_{k}^{2} + \sum_{1 = k + 1}^{p} 2 β_{k} β_{l} corr(x_{k}, x_{l}) .

(1)

In Equation 1 and for the remainder of this article, we assume that the predictor variables are standardized. Similar to ρ, var^1/2(η) can be interpreted as an overall regression coefficient for GLMs. For example, for the case of MLOGR, each regression parameter represents the change in the log odds of success for each standard deviation increase in x. And, for any constant k, because var^1/2(η) = ln {p ₁(1 – p ₁)^–1} – ln {p ₀(1 – p ₀)^–1} where p ₁ = {1 + exp(–η₁)}^–1, p ₀ = {1 + exp(–η₀)}^–1, η₁ = k + var^1/2(η), and η₀ = k; var^1/2(η) represents the change in the log odds of success for each standard deviation increase in the MLP, η. Generally, regression coefficients from MLOGR are difficult to interpret because they represent the change in the log odds of success. The preferred logistic regression effect size measure for individual predictors seems to vary across disciplines. Marginal effects and elasticities (Greene, 2002) are often used in economics and other disciplines to express effect sizes of individual predictors. More commonly, in education, psychology, and many other areas, it is recommended that the odds ratios associated with individual predictors, given by exp(β), are reported (Peng, Lee, & Ingersoll, 2002). Accordingly, when reporting overall effect size, it makes sense to report an overall odds ratio. For MLOGR, exp {var^1/2(η)} represents the population overall odds ratio (OOR) because it is the odds ratio associated with each standard deviation increase in η. Because the odds ratio is commonly used and is well understood as an effect size measure, the OOR leads to a more intuitive understanding of effect size than that offered by R ² analogs. In the section that follows, we introduce the estimator for the PVMLP, the sample variance of the MLP (SVMLP).

The Sample Variance of the Multiple Linear Predictor

An estimator for var(η) (the PVMLP) can be obtained by substituting sample estimates for the population parameters in Equation 1 as follows: $\hat{var} (η) = \sum_{k = 1}^{p} {\hat{β}}_{k}^{2} + \sum_{1 = k + 1}^{p} 2 {\hat{β}}_{k} {\hat{β}}_{l} r_{k l}$ (2)where (β̂ ₁, β̂ ₂, . . . β̂_p) are the maximum likelihood estimates of the regression parameters and r _kl is the sample correlation of the kth and lth predictor variables.

Classical R ² is a special case of the SVMLP; it is given by Equation 2 when y is scaled to have variance of one. Analogous to R, ${\hat{var}}^{1 / 2} (η)$ is the estimate of the overall regression coefficient for GLMs with a(φ) =1. Using Equation 2, the SVMLP can be calculated with the estimated regression coefficients and pairwise correlations of predictor variables. Alternatively, the SVMLP is simply the sample variance of the fitted values of the MLP, η̂.

For the single predictor logistic regression model, $exp {{\hat{var}}^{1 / 2} (η)}$ is equal to the estimated odds ratio associated with a one standard deviation increase in the predictor variable. For MLOGR, $exp {{\hat{var}}^{1 / 2} (η)}$ is the estimated odds ratio associated with a one standard deviation increase in the weighted sum of the model predictors, where the weights (estimated regression coefficients) reflect each predictor’s relative importance. In this case, $exp {{\hat{var}}^{1 / 2} (η)}$ represents the model’s estimated overall odds ratio. If an MLOGR user is accustomed to interpreting the regression coefficients (as opposed to odds ratios), they might choose to report ${\hat{var}}^{1 / 2} (η)$ instead of $exp {{\hat{var}}^{1 / 2} (η)}$ (the estimated OOR). Table 1 gives the interpretations of selected effect size measures when the outcome has a normal or Bernoulli distribution and standard link functions are used. Importantly, neither the OOR nor the SVMLP can be interpreted as measures of proportional reduction in error, unlike R ² and $R_{L}^{2}$ .

Bias Correction for the SVMLP

Generally, the expected value of $\hat{var} (η)$ (the SVMLP) is greater than var(η) (the PVMLP). The SVMLP’s overestimation problem, resulting from the fact that estimated regression models optimally fit the sample with which they are estimated, is well known to users of MLR and classical R ². This bias is most prominent when the number of predictors is large relative to n. A simple adjustment to R ² corrects much of this overestimation problem (Draper & Smith, 1998, p. 140; Wherry, 1931):

R_{a d j}^{2} = 1 - \frac{(n - 1) (1 - R^{2})}{n - p - 1} .

In this expression, p represents the number of regression coefficients, not including the intercept. The overestimation problem has also been documented for R ² analogs, and Liao and McGee (2003) proposed adjustment methods for $R_{L}^{2}$ and $R_{O}^{2}$ .

To quantify the SVMLP’s bias, we first consider the expected value of the SVMLP for the special case of uncorrelated predictor variables (r k l = 0 for all k ≠ 1): $E {\hat{var} (η)} = E (\sum_{k = 1}^{p} {\hat{β}}_{k}^{2} + \sum_{l = k + 1}^{p} 2 {\hat{β}}_{k} {\hat{β}}_{l} r_{k l}) = \sum_{k = 1}^{p} E ({\hat{β}}_{k}^{2}) + 2 \sum_{l = k + 1}^{p} E ({\hat{β}}_{k} {\hat{β}}_{l} r_{k l})$ where $E ({\hat{β}}^{2}) = var ({\hat{β}}^{2}) + E^{2} ({\hat{β}}^{2})$ . In general, $E ({\hat{β}}_{k}) \neq β_{k}$ ; that is, maximum likelihood estimators of regression coefficients from nonlinear regression models are biased (McCullagh & Nelder, 1989). Cordeiro and McCullagh (1991) proposed an estimator of this bias to order n ^–1 as $\hat{bias} (\hat{β}) = - \frac{1}{2 φ} (X^{T} W X)^{- 1} X^{T} Z F \underline{1}$ where φ is the dispersion parameter (φ =1 for logistic regression), X is the design matrix (with p +1 columns, including the intercept), W and F are n ×n diagonal matrices whose elements depend on the distribution of the outcome and the GLM link function used, Z = diag {X(X ^T WX)^–1 X ^T } , and 1 is an n ×1 vector of 1s. For the case of MLOGR, W _i,i ={1+ exp(η̂ _i)}^–2 exp(η̂ _i) and F _i,i ={1+ exp(η̂ _i)}^–4 {exp(η̂ _i)– exp(3η̂ _i)} (see Cordeiro & McCullagh, 1991, for more general formulas).

Using Cordeiro and McCullagh’s (1991) bias approximation, we have $E ({\hat{β}}_{k}) ≅ β_{k} + \hat{bias} ({\hat{β}}_{k})$ and $E {\hat{var} (η)} ≅ \sum_{k = 1}^{p} var ({\hat{β}}_{k}^{2}) + E^{2} ({\hat{β}}_{k}) + 2 \sum_{l = k + 1}^{p} E ({\hat{β}}_{k} {\hat{β}}_{l} r_{k l}) = \sum_{k = 1}^{p} var ({\hat{β}}_{k}^{2}) + {β_{k} + \hat{bias} ({\hat{β}}_{k})}^{2} + 2 \sum_{l = k + 1}^{p} E ({\hat{β}}_{k} {\hat{β}}_{l} r_{k l})$ .

Now, because we have assumed that the predictor variables are uncorrelated, we make the simplifying assumption that $\sum_{l = k + 1}^{p} E ({\hat{β}}_{k} {\hat{β}}_{l} r_{k l}) = 0$ for all values of k. Therefore, $E {\hat{Var} (η)} ≅ \sum_{k = 1}^{p} var ({\hat{β}}_{k}^{2}) + {β_{k} + \hat{bias} ({\hat{β}}_{k})}^{2} = \sum_{k = 1}^{p} var ({\hat{β}}_{k}^{2}) + β_{k}^{2} + 2 β_{k} \hat{bias} ({\hat{β}}_{k}) + \hat{{bias}^{2}} ({\hat{β}}_{k})$ . Because the PVMLP is $\sum_{k = 1}^{p} β_{k}^{2}$ , the bias in estimating this parameter is approximately equal to $\sum_{k = 1}^{p} var ({\hat{β}}_{k}^{2}) + 2 β_{k} \hat{bias} ({\hat{β}}_{k}) + {\hat{bias}}^{2} ({\hat{β}}_{k})$ . As expected, each term in this expression approaches 0 as n approaches infinity. Our proposed estimator for the bias is then obtained by replacing the unknown parameters with sample estimates:

\hat{E} {\hat{var} (η) - var (η)} = \sum_{k = 1}^{p} \hat{var} ({\hat{β}}_{k}^{2}) + 2 {\hat{β}}_{k} \hat{bias} ({\hat{β}}_{k}) + {\hat{bias}}^{2} ({\hat{β}}_{k}) .

(3)

This bias estimator gives rise to a method of adjusting the SVMLP to correct for bias:

In developing our bias estimator, we assumed that the predictor variables were uncorrelated. Therefore, when estimating the bias, the columns of the design matrix X (the matrix should have p columns—one corresponding to each predictor variable) should be orthogonalized so that the pairwise correlations of predictors are forced to zero. This can be done via Gram-Schmidt orthogonalization (most matrix algebra textbooks describe this procedure; for example, see Horn & Johnson, 1990, pp. 15–16). The result of the orthogonalization is that a linear transformation will be applied to the predictor variables in such a way that the pairwise correlations of the transformed variables will be 0.

Standardize the transformed predictor variables.

Fit the MLOGR model using the orthogonalized and standardized predictor variables and obtain estimated regression coefficients ${\hat{β}}_{1}^{*}, {\hat{β}}_{2}^{*} \dots {\hat{β}}_{p}^{*}$ and estimated variances of the estimated regression coefficients $\hat{var} ({\hat{β}}_{1}^{*}), \hat{var} ({\hat{β}}_{2}^{*}), \dots \hat{var} ({\hat{β}}_{P}^{*})$ . Unlike the SVMLP, the regression coefficients associated with individual predictors are not invariant to linear transformations of the predictor variables. Therefore, interpretations of individual effects should be based on regression coefficients obtained before the orthogonalizing transformations.

Compute the unadjusted SVMLP as $\hat{var} (η) = \sum_{k = 1}^{p} {\hat{β}}_{k}^{2} + \sum_{l = k + 1}^{p} {2 \hat{β}}_{k} {\hat{β}}_{l} r_{k l}$ or equivalently, as $\hat{var} (η) = \sum_{k = 1}^{p} {\hat{β}}_{k}^{* 2}$ . Then, compute the adjusted SVMLP as $\hat{var} (η)_{a d j} = \sum_{k = 1}^{p} {\hat{β}}_{k}^{* 2} - \hat{var} ({\hat{β}}_{k}^{* 2}) - 2 {\hat{β}}_{k}^{*} \hat{bias} ({\hat{β}}_{k}) - {\hat{bias}}^{2} ({\hat{β}}_{k})$ .

When the number of predictors is large relative to n, it’s possible for the bias estimate to be greater than the unadjusted SVMLP so that the adjusted SVMLP is less than zero. In this case, the adjusted SVMLP should be set to zero, suggesting no overall effect of the predictors.

Our adjustment method is designed to nullify the bias of the SVMLP. This adjustment will not necessarily eliminate the bias of effect size measures that are functions of the SVMLP. For example, $E {\hat{var (η)}} = var (η)$ does not imply that $E [exp {{\hat{var}}^{1 / 2} (η)}] = exp {{var}^{1 / 2} (η)}$ (the OOR estimator may be biased, even if the SVMLP is unbiased). Still, we expect that the bias of the OOR estimator will be significantly reduced by first nullifying the bias of the SVMLP.

Behavior of Overall Odds Ratio Estimator

Through a simulation study, we examined the behavior of the OOR estimators (adjusted and unadjusted) for MLOGR models. For notational simplicity, we label the OOR parameter and statistic as θ and θ̂ where θ = exp {var^1/2(η)} and $\hat{θ} = exp {{\hat{var}}^{1 / 2} (η)}$ . Across simulations, we varied the following factors: sample size, number of predictors (not including the intercept) p, values of regression parameters (intercept β₀ and effects of individual predictors β), and pairwise correlations of predictor variables ρ. We considered three levels of sample size: n =50, 200, and 10,000; two levels for number of predictors: p =2 and 5; three levels of β₀: –2, 0, and 2 (base rates of approximately .12, .50, and .88); two levels of β: 0 (no association) and ln (2) (odds ratio of 2 corresponding to each predictor); and two levels of ρ: 0 and .50. By varying these factors, we observed the behavior of the OOR estimators under a wide variety of conditions that should permit us to observe the important features of these estimators. For each simulation, the predictor variables were normally distributed with mean 0 and standard deviation 1. Data sets were created under each simulation condition with pseudo-random number generation using SAS/IML software (SAS, 2004a). The MLOGR models were fit using maximum likelihood estimation. A full factorial design would have entailed 72 conditions. Because of MLOGR’s instability when the sample size is small and base rate close to 0 or 1, we only considered 4 conditions where n =50: Hence, we varied the simulation conditions by using a partial factorial design with a total of 52 conditions. To conserve space, we do not list the simulation results for each condition but only those discussed in the forthcoming sections.

For each simulation condition, we monitored statistics across 1,000 simulated data sets including the overall base rate (y̿) and the mean and median of the unadjusted OOR estimates ( $\bar{\hat{θ}}$ and θ̂⁵⁰, respectively) and adjusted OOR estimates ( ${\bar{\hat{θ}}}_{a d j}$ and ${\hat{θ}}_{a d j}^{50}$ , respectively). We also monitored the square root of the mean of the squared differences of the OOR estimates and target parameter, namely,

RMSE = {\frac{1}{1,000} {\sum_{s = 1}^{1, 000} (θ - {\hat{θ}}_{s})}^{2}}^{1 / 2} and ADJ.RMSE = {\frac{1}{1,000} {\sum_{s = 1}^{1, 000} (θ - {\hat{θ}}_{a d j, s})}^{2}}^{1 / 2} .

In addition to studying the behavior of the OOR estimators, we also studied its relationship to the preferred R ² analog—the “log likelihood ratio” R ²( $R_{L}^{2}$ ), as well as the “ordinary least squares” R ²( $R_{O}^{2}$ ), which is known to vary with the base rate. For each simulation condition, we calculated the means of the two (R ² analogs ( ${\bar{R}}_{O}^{2}$ and ${\bar{R}}_{L}^{2}$ ), and the correlation between the unadjusted OOR estimates and each R ² analog, $r_{\hat{θ}, R_{O}^{2}}$ and $r_{\hat{θ}, R_{L}^{2}}$ .

Table 2 reports the simulation results examining the behavior of the adjusted and unadjusted OOR estimators and examining the relationship of the unadjusted OOR estimator with the R ² analogs. We see that the unadjusted OOR estimator is severely biased when the sample size is small. The bias is most severe for n =50 (compare $\bar{\hat{θ}}$ to θ for conditions 5 and 7 and for conditions 4, 6, and 8), and the relative bias appears to be larger when the predictors are correlated (see $\bar{\hat{θ}}$ for conditions 4 and 5). The adjustment method corrects for much of the bias. For example, we see in condition 1 that the mean of the unadjusted OOR estimator is 3.62, the mean of the adjusted OOR estimator is 3.29, and the target parameter is 3.32. The reduction in bias is also apparent when comparing the root mean square errors (RMSEs). The adjusted RMSE is consistently smaller than the unadjusted RMSE, with the most striking improvements seen for smaller sample sizes and larger effect sizes.

The OOR estimators are especially imprecise with small sample size, large number of predictors, and large effect sizes. In condition 13, the population OOR is 14.65, indicating that the odds of success increase by a factor of 14.65 for each standard deviation increase in the MLP. The sample size is just 200, and so we see considerable bias in the unadjusted OOR estimator. The mean of the adjusted OOR is 15.49 and the adjusted RMSE is 13.46, suggesting that the adjusted OOR has little bias but is imprecise when the sample size is small.

The median of the adjusted OOR estimator tends to be smaller than the population OOR when the sample size is small and the effect size is large (see ${\hat{θ}}_{a d j}^{50}$ for onditions 12 and 13). In this case, the adjustment method reduces bias but also shrinks the estimated OOR in such a way that it is usually less than the target parameter. This does not appear to be a problem when the sample size is large or when the effect sizes are null (see ${\hat{θ}}_{a d j}^{50}$ for conditions 10 and 14). When the effect sizes are null, the adjusted OOR estimator will usually take the value 1.0, indicating no effect. The unadjusted OOR estimator on the other hand is always greater than 1.0 and may take large values, even when the effect sizes are null. For example, in condition 3 (n =50, β =0), the median OOR estimator is 1.43 while the target parameter is 1.0.

The adjusted and unadjusted OOR estimators tend to be less biased as the base rate approaches .5. Conditions 11, 14, and 15 are the same except for the intercept β₀—which varies as −2, 0, and 2. We see that the RMSE is the smallest for β₀ = 0 (.41 compared to .73 and .74), which corresponds to a base rate of approximately .5. Similarly, the adjusted RMSE is the smallest for β₀ = 0 (.18 compared to .30 and .31).

From Table 2, we see that the unadjusted OOR estimator has a strong linear relationship with the R ² analogs within each simulation condition. The exception to this is condition 13, where the sample size is small and the true overall effect size is extremely large. In this case, the estimated OOR values are likely to vary wildly (see the large RMSE and adjusted RMSE) while the R ² analogs are constrained to the [0, 1] interval. Typically, the correlations of the OOR estimates and R ² analogs are greater than .9. The correlations with the OOR estimator are consistently higher for $R_{L}^{2}$ relative to $R_{O}^{2}$ . Furthermore, the OOR estimator is highly correlated with the R ² analogs across simulation conditions: Over the 52 simulation conditions, the correlation of the mean of the OOR estimates and the mean of the $R_{O}^{2}$ estimates was .92. The correlation of the mean of the OOR estimates and the mean of the $R_{L}^{2}$ estimates was .93.

While the OOR is invariant to the base rate, our simulations confirm Menard’s (2000) finding that $R_{O}^{2}$ depends on the base rate. Conditions 2, 9, and 10 vary only with respect to the intercept β₀—which varies as –2, 0, and 2. We see that the mean of $R_{O}^{2}$ varies as .167, .223, and .166 while the mean of the adjusted OOR estimator stays almost constant at 3.32, 3.33, and 3.32. We also see that the mean of $R_{L}^{2}$ varies with the base rate as .169, .178, and .169. In this article’s Discussion, we will more directly address $R_{L}^{2}$ and its dependence on the base rate.

Example: Measuring the Incremental Validity of Psychosocial Factors in Predicting College Retention

Robbins, Allen, Casillas, Peterson, and Le (2006) studied the degree to which measures of psychosocial factors predicted first-year college outcomes. They conducted a large-scale study where 48 institutions and 14,464 incoming college students were sampled. The goal of the study was to measure the incremental predictive validity of the Student Readiness Inventory (SRI), an assessment given to students that measures certain psychosocial factors that may be related to success in college. Robbins et al. measured the incremental predictive validity of the SRI by modeling college retention with SRI scores after first adjusting for postsecondary institution, high school grade point average (GPA), and standardized achievement test scores. The SRI consists of 10 individual scales, each measuring a psychosocial factor possibly related to success in college: Academic Discipline, Academic Self-Confidence, Commitment to College, Communication Skills, Emotional Control, Goal Striving, General Determination, Study Skills, Social Activity, and Social Connection.

A series of MLOGR models was fit that measured the incremental validity of the SRI for predicting college retention. The final model in the series contained certain SRI scale scores, standardized achievement test scores, high school GPA, and postsecondary institution as predictors. Each prior model was nested within the final model so that the increase in the model’s overall effect size could be assessed for each group of predictors. In Table 3, we present overall effect size estimates for the nested MLOGR models. The sample represented in Table 3 includes students in the SRI national validity study who enrolled at 4-year institutions (see Robbins et al., 2006, for details). The sample sizes for the models were 6,817 for retention after the first semester and 7,554 for retention after the first year. In Table 3, we present the estimated OOR (unadjusted and adjusted), $R_{O}^{2}$ , and $R_{L}^{2}$ .

For retention after the first semester, the base rate (overall proportion of students who dropped out) was .10; for retention after the first year, the base rate was .27. Although the adjusted OOR is nearly the same for retention after the first semester and retention after the first year (2.13 and 2.09, respectively), we see that the $R_{O}^{2}$ values are quite different (.057 and .089, respectively). As previously discussed, it is misleading to use $R_{O}^{2}$ to compare models where the base rates are different. For this example, we see that $R_{O}^{2}$ increases with the base rate but that the increase does not indicate an increase in the model’s overall effect size. Similar to the OOR, $R_{L}^{2}$ is about the same for the two outcomes (.080 and .079, respectively). Unlike the OOR, $R_{L}^{2}$ does not allow us to assess the model’s effect size in terms of a common measure.

For retention after the first year, the estimated OOR of the final model was 2.14. Therefore, the odds of a student returning after the first year are estimated to be 2.14 times higher for each one standard deviation increase in the model’s MLP. The model used 29 predictor variables (23 for institution, 1 for each additional predictor), not including the intercept, so there was concern for an inflated OOR estimate. However, due to the large sample size (n =7,554), we see that the adjusted OOR is not drastically different than the unadjusted OOR. We find that the model’s overall effect size is improved with the addition of SRI scores, as the unadjusted and adjusted OOR increase by 0.13.

For MLOGR, the OOR estimate can be compared to the standardized odds ratios of the individual predictors. In Table 4, we see that the odds of retention after the first year increase with increases in ACT Composite Score, high school GPA, and the SRI Academic Discipline, Commitment to College, and Social Connection scales. The odds of retention decrease with increases in Social Activity. The predictor institution is a nominal variable with 24 categories (24 4-year institutions are represented in this sample). It’s cumbersome to inspect the individual odds ratios associated with each category (institution) relative to a reference category. However, the OOR estimate allows us to see the overall impact of using institution as a predictor variable: From Table 3, we see that the OOR estimate was 1.76 when institution was the only effect in the model. This implies that institutional variation in retention rates explains much of the overall variation in retention rates.

Practical Guidelines for Using the Overall Odds Ratio

Qualitative Effect Size Interpretations

What value of OOR represents a “small,” “medium,” or “large” overall effect size? Cohen (1988) suggested qualitative descriptors of effect sizes that are commonly used in the behavioral sciences. For example, for r (Pearson’s correlation), a value of .10 is labeled small, .30 is medium, and .50 is large (Cohen, 1988, pp. 79–81). Rosenthal (1996) and Valentine and Cooper (2003) discussed the dangers of using these qualitative descriptors of effect size. Both authors cited the fact that large effect sizes are common in the physical sciences but rare in the behavioral sciences, and it is therefore misleading to have universal criteria for effect sizes. Rosenthal wrote, “Qualitative descriptors, to be useful, should be grounded in the context of social science where associations, in a strict mathematical sense, tend to be weak” (p. 42). Furthermore, Rosenthal suggested that qualitative interpretation “should be tempered and adjusted by sub-field and context” (p. 43). Following Rosenthal’s advice, we should compare the adjusted OOR estimates obtained for our example of MLOGR for college retention (2.13 and 2.09) to OOR values obtained in previous empirical studies of college retention. Of course, this is not yet possible as the OOR has not been used previously.

Even with the aforementioned limitations of effect size benchmarks, we’d like to have some idea of what represents a “small,” “medium,” or “large” OOR. In an attempt to establish these benchmarks, we first related the standardized odds ratio to ρ (the correlation coefficient). Then, by extension, we related the OOR to R ². Our approach, described in detail in the following, was to find the odds ratio that results when a continuous variable Y has correlation ρ with a predictor X and is transformed to a dichotomous variable Y ^* at some cutoff point c.

First, we assume that random variables X and Y have a bivariate normal distribution, with means of zero and standard deviations of one. Then, let Y ^* =1 if Y ≥ c and Y ^* = 0 otherwise. Considering Y ^* as the dichotomous outcome variable and X the predictor variable, the standardized odds ratio (contrasting X =1 to X =0) is $OR = \frac{Pr (Y \geq c | X = 1)}{1 - Pr (Y \geq c | X = 1)} {\frac{Pr (Y \geq c | X = 0)}{1 - Pr (Y \geq c | X = 0)}}^{- 1}$ where $Pr (Y \geq c | X = x) = Pr (Z \geq \frac{c - ρ x}{\sqrt{1 - ρ^{2}}}) = 1 - Φ (\frac{c - ρ x}{\sqrt{1 - ρ^{2}}})$ where Φ is the normal c.d.f. and ρ is the correlation of X and Y. Hence, the odds ratio is related to the correlation as in Equation 4. $OR = Φ (\frac{c}{\sqrt{1 - ρ^{2}}}) Φ^{- 1} (\frac{c - ρ}{\sqrt{1 - ρ^{2}}}) {1 - Φ (\frac{c - ρ}{\sqrt{1 - ρ^{2}}})} {1 - Φ (\frac{c}{\sqrt{1 - ρ^{2}}})}^{- 1}$ (4)For example, if c =0 (i.e., the mean of Y ^* is .5), then $OR = Φ^{- 1} (\frac{- ρ}{\sqrt{{1 - ρ}^{2}}}) {1 - Φ (\frac{- ρ}{\sqrt{1 - ρ^{2}}})}$ represents the odds ratio corresponding to a standard deviation increase in X, relative to X =0. So, in this case, r =.10 (“small” effect size, Cohen, 1988, p. 79) would relate to $OR = Φ^{- 1} (\frac{- .10}{\sqrt{1 - {.10}^{2}}}) {1 - Φ (\frac{- .10}{\sqrt{1 - {.10}^{2}}})} = \frac{1 - Φ (- .1005)}{Φ (- .1005)} = \frac{1 - .46}{.46} = 1.174$ . Similarly, r=.30 (“medium” effect size) would relate to OR=1.656, and r=.50 (“large” effect size) would relate to OR=2.548. Cohen (1988) suggested that R ² (R) values of .0196 (.1400), .1304 (.3612), and .2593 (.5092) represent “small,” “medium,” and “large” effect sizes for MLR (pp. 413–414). By substituting Cohen’s R benchmarks for p in Equation 4, we can obtain effect size benchmarks for the OOR. For the case of c=0, the OOR benchmarks are 1.253 (“small”), 1.863 (“medium”), and 2.609 (“large”). In Table 5, we give the OOR benchmarks corresponding to other values of c (i.e., other values of the base rate). Unfortunately, as seen in Table 5, the OOR benchmarks vary considerably with c. Hence, our proposed method of relating the OOR to R ² results in effect size benchmarks that change with the base rate. In the example, we obtained an adjusted OOR estimate of 2.09 for retention after the first year, with a base rate of .27. Using Equation 4 with c = .60 (base rate = .27), the effect size benchmarks are 1.264, 1.889, and 2.634. Hence, using this method of relating the OOR to R ², we would conclude that the OOR of 2.09 lies between a “medium” and “large” effect size.

It should be stressed that different methods of relating the OOR to established effect size measures such as r or d may lead to different benchmarks. Because of this complexity and the fact that our proposed benchmarks depend on the base rate, we urge users to avoid using qualitative effect size interpretations for the OOR. Instead, we recommend only comparing OOR values in the context of the field of study. Further work is needed to establish OOR effect size benchmarks that are invariant to the base rate and related in a meaningful way to existing benchmarks.

Calculating the Adjusted and Unadjusted Overall Odds Ratio Using Statistical Software

The SAS LOGISTIC procedure for SAS version 9.1 (SAS, 2004b) does not report $R_{L}^{2}$ as part of its standard output. But, as shown in Figure 1, $R_{L}^{2}$ can easily be computed as a function of the –2 Log L “with intercept only” and “with intercept and covariates” statistics. The unadjusted OOR is a function of the sample variance of the fitted values of the MLP (SVMLP), which is a function of the fitted probabilities of success. Because the fitted probabilities are available on request from the LOGISTIC procedure, the unadjusted OOR can be calculated as shown in Part 1 of the appendix. Alternatively, the unadjusted OOR can be calculated based on MLOGR parameter estimates and sample correlations (see Equation 2).

Calculation of the adjusted SVMLP and adjusted OOR is not as straightforward because we must implement the bias-correcting steps described earlier. In Part 2 of the appendix, we provide a SAS macro that calculates the adjusted versions of the SVMLP and OOR as well as the unadjusted versions and $R_{L}^{2}$ . This macro is available as a text file on request from the first author. To call this macro, the user must specify the name of the data set containing the outcome and predictor data, the name of the dichotomous outcome variable, and the names of the predictor variables; the program will then output the desired statistics. Although we have only provided SAS code and examples for computing the OOR, other brands of statistical software or generic programming languages could also be adapted to calculate these measures.

Discussion

R ² analogs for MLOGR are not interpretable on the same scale as the effects of individual predictors. For example, users of MLOGR cannot convert an R ² analog into an odds ratio in the same way that users of MLR can convert classical R ² into a standardized regression coefficient. Users of MLOGR can use the OOR to measure the overall predictive strength, or effect size, of the model’s predictors. Similar to R applied to MLR, the OOR is interpretable on the same scale as individual effects.

When applied to MLOGR, R ² analogs are dependent on the base rate so that meaningful comparisons of model effect size cannot be made if the base rates are different. On the other hand, because Equation 1 does not involve β₀, we know that the OOR is invariant to the base rate. Among the R ² analogs, $R_{L}^{2}$ appears to be the most invariant to the base rate. Unlike the OOR, we do not have a closed form expression for $ρ_{L}^{2}$ , the parameter estimated by $R_{L}^{2}$ . Hence, we conducted a simulation study to try to better understand $R_{L}^{2}$ ’s level of dependence on the base rate. We generated data according to the MLOGR model with n =100,000 and two uncorrelated predictors with β₁ = β₂ = 1. Using Equation 1, this implies that the PVMLP is 2 and the true OOR is therefore 4.11. Across 15 simulation conditions, we varied the intercept (β₀) to achieve base rates ranging from approximately .005 to .50. For each condition, we generated 10 data sets. For each data set, we computed $R_{L}^{2}$ , $R_{O}^{2}$ , and θ̂ (the unadjusted OOR estimate). For each condition, we calculated the mean of each measure: ${\bar{R}}_{L}^{2}$ , ${\bar{R}}_{O}^{2}$ , and $\bar{\hat{θ}}$ . In Figure 2, we plot the values of ${\bar{R}}_{L}^{2}$ and ${\bar{R}}_{O}^{2}$ against one another and against the base rate. As expected, ${\bar{R}}_{O}^{2}$ clearly increases with the base rate. To a much lesser degree, we also see that ${\bar{R}}_{L}^{2}$ varies with the base rate. It appears that when the base rate is close to 0 (or 1), ${\bar{R}}_{L}^{2}$ is quite dependent on the base rate but that this dependence wanes with more moderate base rates. The results of this simulation suggest that $R_{L}^{2}$ , although still preferred over $R_{O}^{2}$ , is not invariant to the base rate and is especially problematic when the base rate is extreme. In Figure 3, we plot the simulated values of $\bar{\hat{θ}}$ against the base rate. As expected, the values do not appear to vary with the base rate but rather scatter about the overall odds ratio parameter (θ =4.11). In our example, we showed that the overall effect size of a set of predictors was roughly the same for two different outcomes: retention after the first semester and retention after the first year. Because the base rates for the two retention outcomes were quite different (.10 vs. .27), $R_{O}^{2}$ was greater for the outcome with the higher base rate. On the other hand, the OOR estimate suggested that the overall effect size of the predictors was nearly identical for the two outcomes.

The overall effect size of a model’s predictors cannot be assessed by simply inspecting the estimated regression coefficients. As seen in Equation 1, the PVMLP is a function of regression coefficients but also depends on the inter-correlations of the predictor variables. If the regression coefficients are all positive (or all negative), the overall effect size will be greater if the predictors are positively correlated. This is clearly seen when we inspect the values in Table 2. For example, if we have five uncorrelated predictor variables, each with an odds ratio of 2, the OOR is 4.71 (condition 12). However, if the five predictor variables all have pairwise correlations of .5, the OOR is 14.65 (condition 13). Because of this complexity, simple inspection of estimated regression coefficients is not a good approach for assessing a model’s overall effect size.

The SVMLP does not allow for direct comparisons of models when the outcomes have different probability distributions. For example, it would not be meaningful to compare the SVMLP from MLOGR with the SVMLP from MLR (classical R ²) because the two measures are on different scales. R ² analogs are generally on the same zero to one scale as classical R ², so users may be tempted to compare models’ effect sizes even when the outcomes have different probability distributions. But, this comparison is not valid because the variance of some nonnormal outcomes (e.g., Bernoulli) is tied directly to the mean and the meaning of “proportion of variance explained” depends on the probability distribution of the outcome.

Generally, the SVMLP is biased and may need correction. Although eliminating the bias of the SVMLP does not necessarily eliminate the bias of the OOR estimator (which is a function of the SVMLP), our simulation study showed that the bias is reduced considerably. Interestingly, the bias of the OOR estimator increases with overall effect size. Generally, the bias is more severe when the sample size is small relative to the number of predictors. Our proposed method for adjusting the SVMLP eliminates most of the bias of the OOR estimator. There are certainly other methods available for adjusting the OOR estimator. For example, Sapra (2002) demonstrated how the jackknife procedure (Efron, 1982) can be applied to estimate regression coefficients for the Probit model; the same general approach could be used to derive a bias-corrected OOR estimate. Future work might identify a bias estimator with better properties than our proposed estimator.

When the sample size is large relative to the number of predictors, there is little difference in the adjusted and unadjusted SVMLP, so that adjustment is not necessary.

In this work, we have restricted our study to special cases of MLOGR. Further work is needed to understand how the SVMLP could be used as the basis for measuring overall effect size for other GLMs. More work is also needed to develop a method for approximating standard errors and confidence intervals for the SVMLP and OOR so that users know the precision of the estimate of the target parameter. Further research could also explore whether the SVMLP could be extended to mixed-effect models. So far, we have only considered fixed-effect models, where the MLP has the form $η = β_{0} + \sum_{k = 1}^{p} x_{k} β_{k}$ and (x ₁, x ₂, . . . x _p) are observed predictor variables. To extend to mixed-effect models, we would need to consider models of the form $η = β_{0} + \sum_{k = 1}^{p} x_{k} β_{k} + \sum_{r = 1}^{q} z_{r} b_{r}$ where (b ₁, b ₂, . . . b _q) are unobserved random effects. For fixed-effect models, we considered the PVMLP as the parameter of interest—perhaps a similar parameter could be defined for mixed-effect models.

In conclusion, we suggest that users of MLOGR report the estimated OOR to describe the model’s overall effect size in terms of the odds ratio, which is a meaningful and generally accepted effect size metric for dichotomous outcomes. The OOR should supplement $R_{L}^{2}$ , which is a proportional reduction in error measure and appears to have the most desirable properties among the competing R ² analogs. By adopting the OOR, users of MLOGR can compare the effect sizes of models with different base rates and have a better intuitive grasp of how well the model’s predictors jointly predict the outcome of interest.

Footnotes

Appendix

Figures and Tables

Acknowledgements

The authors would like to thank Nancy Petersen, Justine Radunzel, and Richard Sawyer of ACT, Inc. for their reviews and suggested improvements of the article. Also, we thank Steve Robbins of ACT, Inc. for the college retention data used in the example.

References

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.