Weakly Informative Prior for Point Estimation of Covariance Matrices in Hierarchical Models

Abstract

When fitting hierarchical regression models, maximum likelihood (ML) estimation has computational (and, for some users, philosophical) advantages compared to full Bayesian inference, but when the number of groups is small, estimates of the covariance matrix (Σ) of group-level varying coefficients are often degenerate. One can do better, even from a purely point estimation perspective, by using a prior distribution or penalty function. In this article, we use Bayes modal estimation to obtain positive definite covariance matrix estimates. We recommend a class of Wishart (not inverse-Wishart) priors for Σ with a default choice of hyperparameters, that is, the degrees of freedom are set equal to the number of varying coefficients plus 2, and the scale matrix is the identity matrix multiplied by a value that is large relative to the scale of the problem. This prior is equivalent to independent gamma priors for the eigenvalues of Σ with shape parameter 1.5 and rate parameter close to 0. It is also equivalent to independent gamma priors for the variances with the same hyperparameters multiplied by a function of the correlation coefficients. With this default prior, the posterior mode for Σ is always strictly positive definite. Furthermore, the resulting uncertainty for the fixed coefficients is less underestimated than under classical ML or restricted maximum likelihood estimation. We also suggest an extension of our method that can be used when stronger prior information is available for some of the variances or correlations.

Keywords

Bayes modal estimation penalized likelihood estimation variance estimation Heywood case mixed-effects model multilevel model

Hierarchical or mixed-effects regression models are increasingly popular in applied statistics and can be viewed as Bayesian at the following two levels: A prior distribution is assigned to the varying coefficients, and the parameters of that prior distribution themselves are given a hyperprior. The family of models can be written in general terms as follows: Data are in groups j = 1,… J. For each group j, there is a response vector y _j and two data matrices, X_j and Z_j , that have fixed and varying coefficients, respectively. The data model is $p (y_{j} | X_{j} β + Z_{j} b_{j})$ , where β is the vector of fixed coefficients and b _j is the vector of regression coefficients that varies by group. The vectors b _j are modeled as independent draws from a prior distribution, p(b _j ), given some hyperparameters. We shall assume a normal model for the varying coefficients, so that b _j ˜ N(0, Σ). The model could also include a nonzero mean vector or a group-level regression structure for the hyperprior distribution, but these can be folded into the fixed coefficients in the data model without loss of generality.

There is a rich literature on full Bayesian inference for hierarchical regressions. There is also an empirical Bayes version in which the hyperparameters (in this case, Σ) are estimated via maximum likelihood (ML) and then inference for the coefficients is performed conditional on the estimated Σ. From the Bayesian perspective, the empirical Bayes approach is suboptimal, both because it avoids the use of any prior information on Σ and because it understates posterior uncertainty. From a pragmatic perspective, however, we recognize that the point estimation approach has two advantages that give it great appeal to many users. First, existing software such as lme4 in R and various commands in Stata allow such models to be fit fast and reliably for moderate-sized data sets, whereas software for Markov chain Monte Carlo simulation for full Bayes inference is not yet so immediately practical. Second, the non-Bayesian motivation behind point estimation is attractive to practitioners who want the benefits of partial pooling and hierarchical modeling without needing to specify prior information or fully buy into the Bayesian paradigm.

The subject of this article is the use of Bayesian ideas and methods to produce better inferences for hierarchical models via better point estimates of the hyperparameters. In that sense, this work falls into a long tradition of Bayesian tools used for practical non-Bayesian inferences (e.g., Agresti & Coull, 1998). Bayes modal (BM) estimation (or penalized likelihood) has also been used to obtain more stable estimates in item response theory (e.g., Mislevy, 1986; Swaminathan & Gifford, 1985; Tsutakawa & Lin, 1986) and to avoid boundary estimates (or logit parameters tending to ±∞) in log-linear models (Galindo-Garre, Vermunt, & Bergsma, 2004), logistic regression (Gelman, Jakulin, Pittau, & Su, 2008), varying-intercept models with constant coefficients (Chung, Rabe-Hesketh, Dorie, Gelman, & Liu, 2013), random-effects meta-analysis models (Chung, Rabe-Hesketh, & Choi, 2013), and latent class analysis (Galindo-Garre & Vermunt, 2006; Maris, 1999). Such an approach has also been used to obtain nondegenerate covariance matrices in factor analysis (Martin & McDonald, 1975), in finite mixtures of normal densities (Ciuperca, Ridolfi, & Idier, 2003; Vermunt & Magidson, 2005), and in multivariate regression (Warton, 2008). In varying intercept models, the Stein loss function (Srivastava & Kubokawa, 1999) and an extension of MANOVA estimation (Amemiya, 1985) have been used for obtaining nonnegative definite covariance estimators.

The key problem solved by our method is the tendency of ML estimates of Σ to be degenerate, that is, on the border of positive definiteness, which corresponds to zero variance or perfect correlation among some linear combinations of the parameters. When the ML estimate of a hierarchical covariance matrix is degenerate, this often arises from a likelihood that is nearly flat in the relevant dimension and just happens to have a maximum at the boundary.

Our solution is a class of weakly informative prior densities for Σ that go to zero on the boundary as Σ becomes degenerate, thus ensuring that the posterior mode (i.e., the maximum penalized likelihood estimate) is always nondegenerate. We recommend a class of Wishart priors with a default choice of hyperparameters, that is, the degrees of freedom is the dimension of b _j plus 2 and the scale matrix is the identity matrix multiplied by a large enough number. This prior can be expressed as a product of gamma(1.5, θ) priors on the eigenvalues of Σ or as a product of gamma(1.5, θ) priors on variances of the varying effects with rate parameter θ → 0 and a function of the correlations (a beta prior in the two-dimensional case). In the varying-intercept model (Chung, Rabe-Hesketh, Dorie, et al., 2013) and random-effects meta-analysis model (Chung, Rabe-Hesketh, & Choi, 2013), the gamma(1.5, θ) prior successfully avoids boundary estimates while producing estimates that are consistent with the data. We show that this is also true for the default Wishart prior proposed in this article for general varying coefficient models.

In a simulation study and an education example presented later, the default Wishart prior always gives nondegenerate estimates of Σ (in particular, nonperfect correlation coefficients) without decreasing the log likelihood substantially. The BM estimators of the standard deviations and correlations using the default Wishart prior have better statistical properties than the (restricted) ML estimators.

When prior information is available for specific standard deviations or correlations, additional penalty functions may be included. Specifically, if the prior most plausible value for a standard deviation or correlation parameter is σ* or ρ*, respectively, then we propose multiplying the Wishart prior by the gamma(2, 2/σ*) or N(ρ*, .25²) densities. This assigns more prior probability around the preferred values while exploiting the property of the Wishart prior that it ensures that the estimates remain positive definite.

The outline of the article is as follows. First, we illustrate the boundary estimation problems encountered in ML estimation of hierarchical variance and covariance parameters. Then, we introduce the default Wishart prior for Σ and investigate its properties. Next, additional penalty functions are proposed that incorporate further prior knowledge for some of the parameters. Finally, our method is applied to an example from education research and simulated data.

Boundary Estimation Problem

Consider the varying-coefficients model,

y_{i j} = x_{i j}^{T} β + z_{i j}^{T} b_{j} + ϵ_{i j}, i = 1, \dots, n_{j}, j = 1, \dots, J,

where y_ij is the response variable for unit i in group j, x _ij is a p-dimensional covariate vector with constant (or fixed) coefficients β, z _ij is a d-dimensional covariate vector with varying coefficients b _j ˜ N(0, Σ), and $ε_{i j} ~ N (0, σ_{ε}^{2})$ is a residual for each observation. We further assume that b _j and ε _ij are independent of each other and of the covariates (and suppress conditioning on covariates througout the paper).

Non-Bayesian Point Estimation

For each j, $y_{j} = (y_{1 j}, \dots, y_{n_{j} j}) ~ N {(X_{j} β, V_{j})}^{'}$ , where X_j is a n_j × p matrix with $x_{i j}^{T}$ in the ith row, $V_{j} = Z_{j} Σ Z_{j}^{T} + σ_{ε}^{2} I$ , and Z_j is a n_j × d matrix with $z_{i j}^{T}$ in the ith row. The log-likelihood function is given by:

log p (y | β, Σ, σ_{ε}^{2}) = - \frac{1}{2} [\sum_{j = 1}^{J} log | V_{j} | + \sum_{j} {(y_{j} - X_{j} β)}^{T} V_{j}^{- 1} (y_{j} - X_{j} β)],

where the constant term, −(N/2)log(2π), has been dropped. The ML estimator is obtained by maximizing the log-likelihood function.

It is known that the ML estimator of the covariance matrix is biased for finite samples (Lehmann & Casella, 1998), and an often-preferred option is restricted maximum likelihood (REML; Patterson & Thompson, 1971), as it takes into account the degrees of freedom for the fixed coefficients β. Harville (1974) showed that the REML estimator can be derived by specifying flat prior distributions for β, marginalizing over β, and maximizing the marginal (or restricted) likelihood with respect to Σ and $σ_{ε}^{2}$ . The restricted log-likelihood function is given by:

\begin{aligned} log p_{R} (y | Σ, σ_{ε}^{2}) = - \frac{1}{2} [log |\sum_{j = 1}^{J} X_{j}^{T} V_{j}^{- 1} X_{j}| + \sum_{j = 1}^{J} log | V_{j} | \\ + \sum_{j = 1}^{J} {(y_{j} - X_{j} \hat{β})}^{T} V_{j}^{- 1} (y_{j} - X_{j} \hat{β})], \end{aligned}

up to a constant, where

\hat{β} = {(\sum_{j = 1}^{J} X_{j}^{T} V_{j}^{- 1} X_{j})}^{- 1} (\sum_{j = 1}^{J} X_{j}^{T} V_{j}^{- 1} y_{j}) .

Singular Estimates of Σ using ML and REML

ML and REML often yield singular (i.e., nonpositive definite) estimates of Σ. This boundary includes the cases where some varying coefficients have zero variance or a varying coefficient is a linear combination of the other varying coefficients.

We present two simulation studies to demonstrate how often singular estimates of Σ occur in the varying-coefficients model. In the first study, we consider a model with two-dimensional varying coefficients, that is, a varying intercept b _0j and a varying slope b _1j. We set the group size to n = 10 and the number of groups to J = 5 or 10. A covariate that varies within group only was generated from N(0, 1) and group-mean centered. The varying coefficients (b _0j, b _1j) were generated from N(0, σ² I ₂) with σ = 0.25, 0.5, 0.75, 1. Setting the correlation to 0 corresponds to the best-case scenario in the sense of being furthest from the boundary. The within-group variance $σ_{ε}^{2}$ was set to 1 and the fixed coefficients β₀ and β₁ were set to 0. For each of 1,000 random samples of data from the model, we obtained ML and REML estimates using lmer (Bates & Maechler, 2010) in R.

Figure 1a shows the proportion of ML estimates of Σ on the boundary for the two-dimensional case. For J = 5 groups, 87% of the ML estimates are singular when σ = 0.25 and the proportion decreases as σ increases but remains as high as 72% when σ = 1. For J = 10 groups, the proportions are smaller than those for J = 5 but still, in more than 40% of the simulations, the likelihood is maximized at a singular $\hat{Σ}$ . The REML estimator yields smaller proportions of singular estimates with a similar trend (not shown). For J = 10, 79% and 64% of the REML estimates are singular when σ = 0.25 and σ = 1, respectively. For J = 10, the proportion is reduced to 69% and 35% when σ = 0.25 and σ = 1, respectively.

Figure 1.

Proportion of data sets, out of 1,000, where the maximum likelihood (ML) estimate of the covariance matrix is singular. (a) Two-dimensional case: When σ = .25, 87% of the ML estimates are singular for J = 5. As σ and J increase, the proportion decreases but is greater than 40% for the conditions considered. (b) Two to five dimensions σ = 1: As the dimension of Σ increases, there is a rapid increase in the probability of the estimate being degenerate.

Our second simulation study considers various dimensions, from d = 2 to d = 5, each time with a varying intercept and d − 1 varying slopes for n = 10 and J = 5 or 10. The d − 1 covariates were independently drawn from N(0, 1) and centered at their group means as in the previous simulation. The varying coefficients b _j were drawn from N(0, I_d ) and $σ_{ε}^{2}$ was set to 1. Figure 1b presents the proportion of replicates where the ML estimate $\hat{Σ}$ is singular. As the number of dimensions increases, this proportion increases rapidly, exceeding 95% with five varying coefficients for both J = 5 and J = 10. For REML, the proportions of singular estimates are slightly lower than for ML but follow a similar pattern and exceed 35% across all simulation conditions.

In some contexts, singular estimates of the covariance matrix are acceptable or considered as an indication of structural misspecification of the model. In the varying-intercept model, a negative group-level variance estimate is sometimes permitted if the model is viewed as a marginal model for the responses, given the covariates where only the sum of the group-level and within-group variance must be positive (Verbeke & Molenberghs, 2000, pp. 52–53). In factor analysis and structural equation models, a negative variance estimate, called a Heywood case, is sometimes interpreted as model misspecification, especially if the null hypothesis that the variance is nonnegative can be rejected (Kolenikov & Bollen, 2012). However, this article takes a hierarchical perspective of the multilevel linear model, where the intercepts and slopes vary due to omitted group-level variables. Therefore, the variances of the varying coefficients must be nonnegative, and perfect correlations among linear combinations of varying coefficients are regarded as unrealistic.

Weakly Informative Wishart Prior for Σ

We propose posterior modal estimation with a prior on Σ, implicitly assuming uniform priors for the other parameters. With a prior p(Σ), the log-posterior function can be written as follows:

log p (β, Σ, σ_{ϵ} | y) = log p (y | β, Σ, σ_{ϵ}) + log p (Σ) + c,

and we find the mode of $log p (β, Σ, σ_{ϵ} | y)$ . This approach can also be viewed as maximum penalized likelihood estimation where $log p (Σ)$ is a penalty function. We consider a family of Wishart (not inverse-Wishart) densities for the prior on Σ. The Wishart density function on Σ with hyperparameters $ν$ and Ψ is defined by:

p (Σ) = \frac{| Σ |^{(ν - d - 1) / 2} exp [- \frac{1}{2} t r (Ψ^{- 1} Σ)]}{2^{ν d / 2} | Ψ |^{ν / 2} Γ_{d} (ν / 2)}, ν > d - 1, Ψ > 0,

where $Γ_{d} (ν / 2) = π^{d (d - 1) / 4} \prod_{j = 1}^{d} Γ (ν / 2 + (1 - j) / 2)$ , $ν$ is the degrees of freedom, and Ψ is a scale matrix with $E (Σ) = ν Ψ$ .

If we set Ψ to be a diagonal matrix (1/2θ)I_d , the Wishart density of Σ in Equation 5 can be written as:

p (Σ) = \frac{θ^{d ν / 2}}{Γ_{d} (ν / 2)} | Σ |^{(ν - d - 1) / 2} exp (- θ t r (Σ))

= \frac{θ^{d ν / 2}}{Γ_{d} (ν / 2)} \prod_{r = 1}^{d} λ_{r}^{(ν - d - 1) / 2} exp (- θ λ_{r})

\propto \prod_{r = 1}^{d} g (λ_{r} |\frac{ν - d + 1}{2}, θ),

where λ₁,…, λ _d are the eigenvalues of Σ and $g (x | α, θ)$ is the gamma(α, θ) density with shape parameter α and rate parameter θ, $g (x | α, θ) = \frac{θ^{α - 1}}{Γ (α)} x^{α - 1} exp (- x θ)$ . In the previous equations, note that we do not transform the density of Σ to the density of eigenvalues, but just rewrite Equation 5 as a function of eigenvalues without including a Jacobian term.

As a default choice, we propose $ν = d + 2$ and θ → 0. In practice, we can choose a sufficiently small number for θ, for example, θ = 10⁻⁴ or 10⁻⁵. If these two values of θ lead to almost the same parameter estimates, we can consider the choice of θ to be sufficiently close to the limit 0. In order to avoid dependency on the scale of the response variable, we can also use an improper prior $| Σ |^{(ν - d - 1) / 2}$ , which is the same as the Wishart prior up to constant in the limit θ → 0 (Chung, Rabe-Hesketh, Dorie, et al., 2013). This prior is proportional to independent gamma(1.5, θ) densities of the eigenvalues as observed in Equation 6. If Σ is a diagonal matrix, this prior implies gamma(1.5, θ) priors on the diagonal elements of Σ, which is equivalent to gamma(2, θ) priors on the standard deviations when θ → 0. If Σ is not diagonal, we obtain gamma(1.5, θ) priors on the variances and a function of the correlations.

The advantage of this family of density functions is that they equal zero at the boundary—thus, the BM or penalized likelihood estimate for Σ will never be degenerate—but the densities move away from zero when Σ moves off the boundary, so that the posterior mode can be arbitrarily close to degeneracy if this is what the data demand. In contrast, various other families of models do not have these properties, making them less desirable when used for the purpose of BM point estimation. The inverse-Wishart family of density, one of the most commonly used priors for Σ in the full Bayesian inference, is also zero at the boundary. However, it tends to assign an excessive penalty near the boundary because it is a function of Σ⁻¹ and $| Σ |^{- 1}$ while the Wishart density is a function of Σ and $| Σ |$ .

Alternative choices of $ν$ and θ can be considered but $ν$ and θ larger than the default choice will make the prior more informative. This behavior might be preferable if a plausible value of Σ is available. In the next section, we suggest including additional prior information about any specific standard deviation by multiplying the default prior by an additional penalty function, which can be viewed as a special case of the Wishart prior with larger $ν$ and θ.

Priors on the covariance matrix in the varying-coefficients model have been investigated by several authors in the context of full Bayesian modeling. Daniels and Kass (1999) investigated nonconjugate Bayesian estimation of covariance matrices in hierarchical models including an inverse-Wishart prior on covariance matrices with unknown scale and degrees of freedom and a normal prior on Fisher’s z-transformed correlations. Barnard, McCulloch and Meng (2000) decomposed $Σ = D i a g (s) R D i a g (s)$ where s is a vector of standard deviations and R is the correlation matrix, which is assigned marginal or jointly uniform priors. O’Malley and Zaslavsky (2005) propose a scaled inverse Wishart, a decomposition similar to that of Barnard, McCulloch, and Meng (2000) except that the central matrix R itself has an inverse-Wishart distribution rather than being constrained to be a correlation matrix. Our approach is different from these others in being explicitly intended not for full Bayes inference but as a tool to obtain positive definite posterior modal estimates. As such, our concerns are different from those involved in constructing traditional Bayesian priors.

Unlike posterior mean estimation, BM estimation does not involve simulation and is computationally as efficient as ML estimation. By modifying existing ML estimation procedures, gllamm (Rabe-Hesketh, Skrondal, & Pickles, 2005) in Stata and lmer (Bates & Maechler, 2010) in R, we have developed software to find the maximum of the penalized likelihood. The modified gllamm is available from www.gllamm.org and blmer, the modified lmer function, can be found in the blme package available from the Comprehensive R Archive Network.

Varying-Intercept Models: d = 1

The varying-intercept model is a special case of the model in Equation 1 with d = 1, given by:

y_{i j} = x_{i j}^{T} β + b_{j} + ϵ_{i j},

where $b_{j} ~ N (0, σ_{b}^{2})$ and $ϵ_{i j} ~ N (0, σ_{ϵ}^{2})$ . The Wishart prior in Equation 6 is equivalent to a gamma(ν/2,θ) prior on $σ_{b}^{2}$ . With the default choice of hyperparameters, $ν = 3 (= d + 2)$ and θ → 0, the Wishart prior coincides with a gamma(1.5, θ) prior on $σ_{b}^{2}$ .

When θ → 0, the gamma(1.5, θ) prior on $σ_{b}^{2}$ has a density function proportional to σ _b , which is also proportional to the gamma(2, θ) prior on σ _b . The gamma(2, θ) prior on σ _b is recommended as a weakly informative prior for avoiding estimates of σ _b equal to zero in the varying-intercept model (Chung, Rabe-Hesketh, Dorie, et al., 2013) and in random-effects meta-analysis models (Chung, Rabe-Hesketh, & Choi, 2013). Since the gamma(2, θ) prior is 0 at σ _b = 0, the posterior density is also 0 at σ _b = 0 and thus the posterior mode of σ _b is always strictly positive. In addition, since the gamma density has a positive constant derivative at σ _b = 0, the gamma(2, θ) density increases linearly at zero. It follows that the profile likelihood of σ _b (maximized over all the other parameters) dominates the posterior density of σ _b if the likelihood is strongly curved near σ _b = 0. That is, the prior does not rule out positive values near zero if they are supported by the likelihood. Chung, Rabe-Hesketh, Dorie, et al. (2013) show that the posterior mode is approximately one standard error away from zero when the ML estimate of σ _b is zero. Finally, the estimator behaves reasonably well in terms of mean squared error of parameter estimates and coverage of confidence intervals for fixed parameters.

In the context of small area estimation, strictly positive group-level variance estimators have been proposed for the Fay and Herriot model (1979), a varying-intercept model for aggregated group-level data and known heterogeneous within-group variances. Adjustment for density maximization (Li & Lahiri, 2010; Morris, 2006; Morris & Tang, 2011) applies a penalty term $π (σ_{b}^{2}) = {(σ_{b}^{2})}^{c - 1}$ to the likelihood, and this approach turns out to be equivalent to posterior modal estimation with a gamma(α, θ) prior on σ _b with α = 2c + 1 and θ → 1. Therefore, for this specific varying-intercept model, our estimator shares the properties of adjustment for density maximization, such as predictions of the group means being minimax for mean squared-error loss when the within-group variances are equal and $c \leq 1$ (Morris & Tang, 2011).

Varying-Intercept and Varying-Slope Models: d = 2

When d = 2, the model includes a varying intercept and a varying slope of one covariate, written as:

y_{i j} = x_{i j}^{T} β + b_{0 j} + b_{1 j} z_{i j} + ε_{i j},

where $(b_{0 j}, b_{1 j}) ~ N (0, Σ)$ and $ε_{i j} ~ N (0, σ_{ε}^{2})$ .

As shown in Equation 6, with the default choice ν = d + 2, the Wishart density can be written as a product of gamma(1.5, θ) densities on the eigenvalues λ₁ and λ₂. For the bivariate case, we can also express the default prior as a function of the variances ( $σ_{1}^{2}$ and $σ_{2}^{2}$ ) and the correlation (ρ) between the two varying effects b _0j and b _1j, given by:

p (Σ) \propto | Σ |^{1 / 2} = σ_{1} σ_{2} \sqrt{1 - ρ^{2}} .

This expression implies that Wishart(4, (1/2θ)I_d ) with θ → 0 is equivalent to the joint density of independent gamma(1.5, θ) priors on both $σ_{1}^{2}$ and $σ_{2}^{2}$ , and a beta(1.5, 1.5) prior on (ρ + 1)/2.

Since the beta(1.5, 1.5) prior on (ρ + 1)/2 is zero at the boundaries ρ = ±1, the posterior mode of Σ cannot be attained at any matrices with perfect correlation. In addition, the beta(1.5, 1.5) density function increases rapidly as ρ approaches 0 from ±1 and so does not rule out values close to ±1. The left panel of Figure 2 shows the beta(1.5, 1.5) density on (ρ + 1)/2. Whereas gamma(2, θ) increases linearly at 0, the slopes of beta(1.5, 1.5) at ±1 are ±∞. Therefore, compared to the gamma(2, θ) prior for σ₁ and σ₂, the beta(1.5, 1.5) for ρ is less informative with lower penalties on the values around the boundaries.

Figure 2.

Conditional density of ρ _ij with Wishart (d + 2, (1/2θ)I) on $\sum$ , θ = 10⁻⁴, where the other parameters are randomly generated from the Wishart distribution for 20 replicates. When d = 2, the conditional density is beta(1.5, 1.5), but for larger d, the curves are more scattered and the supports of the densities become narrower.

The beta priors have been used to avoid boundary estimates of the probability parameter p of the binomial distribution. When the sample proportion is 0 or 1, the traditional Wald confidence interval for p degenerates to the point estimate. To avoid such boundary estimates, Agresti and Coull (1998) specified a beta(2, 2) prior on p. The posterior mean of p then is the sample proportion after adding two successes and two failures to the data. Compared with the beta(2, 2), the beta(1.5, 1.5) tends to assign less penalty at the boundaries and so is less informative.

Higher Dimensional Case: d ≥ 3

Similar to the case d = 2, the default prior for d ≥ 3 can be written as a product of σ_r, r = 1,…, d and a function of ρ _rs , the correlation between the rth and sth varying effects (0 < r < s, s = 2,…, d). For example with d = 3, the Wishart(5, (1/2θ)I ₃) prior with θ → 0 can be written as:

p (Σ) \propto | Σ |^{1 / 2} \propto σ_{1} σ_{2} σ_{3} \sqrt{1 - ρ_{12}^{2} - ρ_{23}^{2} - ρ_{13}^{2} + 2 ρ_{12} ρ_{23} ρ_{13}} .

This is a product of gamma(1.5, θ) priors on the variances and a function of the correlations. This function depends on the squares of the correlations, as in the two-dimensional case (Equation 7), but also contains the product of three correlations, which comes from the constraint $| Σ | > 0$ that defines the support of Wishart distributions. Because of this constraint, the Wishart prior automatically restricts the posterior mode of Σ to be strictly positive definite.

The graphs in Figure 2 show the conditional densities of ρ₁₂ when Σ follows the Wishart(d + 2,(1/2θ)I_d ), θ = 10⁻⁴. The curves are the density of ρ₁₂ conditional on the other parameter values (standard deviations and the other correlations) that are randomly generated from Wishart(d + 2,(1/2θ)I_d ) with 20 replicates. When d = 2, the correlation follows beta(1.5, 1.5) as discussed previously. When d = 3, the curves have distinct supports, defined by $1 - {(ρ_{12})}^{2} - {(ρ_{23}^{0})}^{2} - {(ρ_{13}^{0})}^{2} + 2 ρ_{12} ρ_{23}^{0} ρ_{13}^{0} > 0$ where $ρ_{13}^{0}$ and $ρ_{23}^{0}$ for each replicate are given by randomly generated Σ. The curves for d = 5 are more scattered and the supports of the densities tend to be narrower than for d = 2 and 3 due to more restrictions required for the higher dimensional Σ to be positive definite.

The marginal prior densities of ρ _rs are displayed in Figure 3 for d = 2, 5, and 10. With 10,000 replicates, d-dimensional matrices were randomly generated from the Wishart(d + 2,(1/2θ)I) with θ = 10⁻⁴ and 10,000(d − 1)(d − 2)/2 correlation coefficients were used to construct the histograms. For d = 2 (left), the distribution of the correlation coefficient matches the beta(1.5, 1.5) density, shown as a solid curve. As d increases, the marginal prior density of ρ _rs becomes more concentrated around zero because of the positive definiteness of Σ.

Figure 3.

Marginal density of ρ _rs with Wishart (d + 2,(1/2θ)I), θ = 10⁻⁴. When d = 2, the marginal density of ρ is equivalent to beta(1.5, 1.5) on (ρ + 1)/2 (solid curve). As d increases, the marginal density has more mass around 0 due to the positive semidefinite constraint of the covariance matrix.

Incorporating Additional Prior Information

In the previous section, we suggested the Wishart(d + 2,(1/2θ)I) with θ → 0 as a default prior when no other information is available. If a researcher has additional prior knowledge about any specific standard deviations or correlations, he or she might want to adjust the prior to incorporate such information. In this section, we suggest multiplying the Wishart prior by functions of the parameters on which we have information. Because the Wishart density ensures that Σ is positive definite, we can choose the functions for the other parameters to be intuitive and easy to specify without regard for the parameter space.

If σ* is a plausible value for σ _r , then the gamma(2,2/σ*) density is recommended as a penalty. Recall that the default Wishart prior is proportional to gamma(2,θ) priors with θ → 0 on each standard deviation, multiplied by a function of the correlations. When the gamma(2,2/σ*) density of σ is multiplied by the Wishart, the part including σ _r becomes $σ_{r}^{2} exp (- 2 σ_{r} / σ^{*})$ . This is proportional to the gamma(3,2/σ*) density that has its mode at σ _r = σ*. The gamma prior with shape parameter greater than two assigns more penalty near zero than for shape parameter equal to two. Therefore, we have a more informative prior with mode at σ*.

If any specific correlation ρ _rs is believed to be close to ρ*, we can incorporate this prior information by multiplying the default Wishart prior by a N(ρ*,τ²) density. As usual, the scale parameter τ can be chosen depending on the prior uncertainty regarding ρ _rs . A possible default choice is τ = .25 because it is the standard deviation of the beta(1.5,1.5) distribution. Figure 4 displays the shape of conditional prior densities of ρ₁₂ with additional normal priors in the three-dimensional case. When ρ₁₃ and ρ₂₃ are fixed at zero (left), the default Wishart(5,(1/2θ)I ₃) prior (solid curve) is pretty flat. In order to incorporate the prior information, for example, ρ* = −.5, the Wishart is multiplied by the N(−.5, .25²) density, and then the prior mode moves toward −.5 (dashed curve). When ρ₁₃ and ρ₂₃ are .5 (middle and right), the support of the Wishart for ρ₁₂ is on [−0.5, 1] because of the constraint of positive definiteness. When our prior value is on the boundary ρ* = −.5 (middle), the Wishart multiplied by N(−.5, .25²) density is skewed toward −.5, but still enforces positive definiteness. When the prior value is inside the support, ρ* = −.5, the resulting density is less skewed (right).

Figure 4.

Conditional prior density of ρ₁₂ with additional N(−.5, .25²) (left and middle) and N(.5, .25²) (right) densities multiplying the default Wishart prior. The Wishart prior is on three-dimensional $\sum$ and ρ₁₃ and ρ₂₃ are fixed as 0 (left) and .5 (middle and right). The additional normal penalty makes the prior density skewed toward the prior value, but still enforces positive definiteness.

The default prior for ρ in the two-dimensional case is beta(1.5, 1.5), and so it would seem natural to use the beta family for ρ _rs when constructing an additional penalty. However, the parameters of the normal distribution are more intuitive because they represent the prior mean (and mode) and variance. In addition, since the positive definiteness of $\hat{Σ}$ is already guaranteed by the Wishart prior, estimates of Σ remain positive definite regardless of the type of additional penalties that multiply the Wishart prior. Furthermore, computation is no problem in any case; including any closed-form prior density adds essentially no cost to the optimization.

Example: A Varying Intercept, Varying Slope Model in Education Research

We illustrate our approach using a study of Heller et al. (2007) on the effects of the Mathematics Pathways and Pitfalls (MPP) teacher professional development program on mathematics learning for students at different levels of English language proficiency. Half of the 36 teachers were randomized to MPP and the other half to the control condition. Teachers randomized to the MPP condition were taught how to use the materials and then substituted MPP for part of their mathematics curriculum during the 2003–2004 school year, while control teachers used their regular mathematics curriculum. All students received an MPP test as a pretest before the lessons and took the same test after the lessons as a posttest.

Posttest scores are regressed on the mean-centered pretest scores, an indicator for treatment group (1 for MPP and 0 for control), English language learner (ELL) status (1 for ELL and 0 for non-ELL), and the Treatment × ELL Interaction Term. A varying intercept and a varying slope for ELL status are included to allow for the cluster-randomized design. The model can be written as follows:

y_{i j} = x_{i j}^{T} β + b_{0 j} + b_{1 j} z_{i j} + ε_{i j},

where y_ij is the posttest score for the ith student of the jth teacher, x _ij is the covariate vector that includes the mean-centered pretest score, the treatment group indicator, ELL status, and the interaction between ELL status and treatment, and z_ij is ELL status. As usual, we assume $(b_{0 j}, b_{1 j}) ~ N (0, Σ)$ and $ε_{i j} ~ N (0, σ_{ε}^{2})$ . After dropping observations with missing values on any of the variables, data were available on 755 students and J = 36 teachers, with between 12 and 27 students per teacher. We fit the models by ML and REML using lmer in the lme4 package and by BM using blmer in the blme package.

Table 1 presents ML, REML, and BM estimates with the default Wishart(4,(1/2θ)I_d ) prior with θ = 10⁻⁴. Both ML and REML estimates of the correlation between b _0i and b _1j are −1. This implies an unrealistic perfect correlation between the teacher-level slopes and intercepts. The BM estimate of ρ is −.32 and the standard deviation estimate of the varying slope for ELL status increases from 0.71 for ML and 0.48 for REML to 3.64, a change that is within the uncertainty implied by the asymptotic standard error of 2.1 (ML) or 2.2 (REML) for that parameter. The standard deviation of the varying intercept stays similar for ML, REML, and BM.

Table 1.

Parameter estimates for education example

	ML	REML	BM
Fixed effect
Intercept	32.39 (2.01)	32.40 (2.07)	32.31 (2.11)
Pretest	0.56 (0.06)	0.56 (0.06)	0.56 (0.06)
Treatment	12.84 (3.15)	12.81 (3.24)	13.01 (3.30)
ELL	−2.46 (2.73)	−2.54 (2.77)	−2.66 (3.17)
ELL × Treatment	1.00 (4.13)	1.24 (4.19)	1.56 (4.84)
Varying effect (group: teacher)
Intercept SD	8.31 (1.18)	8.62 (1.25)	8.50 (1.22)
ELL SD	0.71 (2.09)	0.48 (2.18)	3.64 (2.50)
Correlation	−1.00 (2.93)	−1.00 (0.00)	−0.32 (0.22)
Residual SD	226.5	227.3	226.3
Log likelihood	−3,153.7	−3,153.8	−3,154.2

Note. ELL = English language learner; SD = standard deviation; ML = maximum likelihood; REML = restricted maximum likelihood; BM = Bayes Modal. The ML and REML estimates imply perfect correlation between the varying intercept and varying slope, whereas BM produces more reasonable estimates. The log likelihood stays almost the same among the three methods. We present results here to more decimal places than would be recommended in practice in order to display the sometimes-small differences between the different estimates.

The fixed coefficient estimates are similar across estimation methods. The coefficient for the interaction term between ELL and treatment changes the most among all the fixed coefficients, but the differences are negligible considering that the standard errors of the interaction term are greater than 4. The standard errors of the fixed coefficient estimates of Treatment, ELL, and Treatment by ELL are larger for BM than for ML or REML, suggesting that ML and REML underestimate the uncertainty.

The log likelihood at the BM estimates differs from the maximum by less than 1. Figure 5 shows the profile likelihood of ρ (profiling out all the other parameters) divided by its maximum. Although the ML is attained at ρ = −1, the profile likelihood is very flat and so the minimum (at ρ = 1) is attained with only an 8% decrement from the maximum. Therefore, all the values of ρ including ρ = −0.32 are well supported by the data. As is typical in such settings, there is nothing special about the point estimate on the boundary, and it would be inappropriate for a researcher to use that estimate. Our BM approach gives a default procedure that allows a classical statistician to avoid the inappropriate degenerate estimate. A full Bayes approach using real prior information would do better, but our BM approach takes us a bit in the right direction and has the advantage of being fast and easy to implement.

Figure 5.

Profile likelihood of ρ. The maximum likelihood estimate of ρ is −1 but the likelihood has very little information. Therefore, the Bayes modal estimate of −0.3 is also well supported by the data.

When a researcher is interested in comparing teacher-specific effects, b _0j and b _1j can be predicted by their conditional posterior means (or modes), given the estimates of the model parameters and the data (called empirical Bayes prediction or best linear unbiased prediction).

In Figure 6, scatter plots of empirical Bayes predictions of b _1j versus b _0j are displayed with the proportion of ELL students of each teacher represented by the gray scale, that is, black indicates all the students are ELL and white indicates none are ELL. The sizes of the squares are proportional to the numbers of students for each teacher. For ML (left), due to the estimate $\hat{ρ} = - 1$ , the slopes b _1j are predicted perfectly linearly by the intercepts b _0j. In contrast, BM (right) shows more reasonable predictions for the varying slopes and intercepts. In addition, we can observe that 18 (out of 36) white squares with a gray border fall perfectly on a line—these are teachers without any ELL students in their classes. Four black squares (only three visible due to overlap) correspond to teachers with only ELL students. The 18 groups without ELL students and the 4 groups with only ELL students do not provide any information about the slope variance and intercept variance, respectively, and none of the 22 groups provide information about the correlation between the varying slope and the intercept. This lack of information could be one of the reasons we obtain the boundary estimates using ML and REML. As the group size increases (i.e., the square increases) and the proportion of ELL students increases (i.e., the square gets darker), the empirical Bayes predictions tend to be less shrunken toward the line formed by the white squares.

Figure 6.

Empirical Bayes predictions of varying effects. The size of each square represents n_j for the jth teacher. The ratio of English language learner (ELL) students for each teacher is shown on a gray scale, that is, black indicates all the students are ELL and white indicates none are ELL. For maximum likelihood, b _1j are predicted perfectly linearly in b _0j. On the right graph, Bayes modal estimation shows more reasonable predictions for the varying slopes and intercepts.

Using the fitted covariance matrix, we can calculate the marginal variances and correlations of the posttest score given ELL status. The variance of the posttest scores for ELL students is $V a r (y_{i j} | z_{i j} = 1) = σ_{1}^{2} + σ_{2}^{2} + 2 σ_{12} + σ_{ε}^{2}$ and, similarly, the variance for non-ELL student is $V a r (y_{i j} | z_{i j} = 0) = σ_{1}^{2} + σ_{ε}^{2}$ . The covariance between the posttest scores of two students of the same teacher is $C o v (y_{i j}, y_{i^{*} j} | z_{i j} = 1, z_{i^{*} j} = 1) = σ_{1}^{2} + σ_{2}^{2} + 2 σ_{12}$ if both students are ELL, $C o v (y_{i j}, y_{i^{*} j} | z_{i j} = 1, z_{i^{*} j} = 0) = σ_{1}^{2} + σ_{12}$ if one student is ELL, and $C o v (y_{i j}, y_{i^{*} j} | z_{i j} = 0, z_{i^{*} j} = 0) = σ_{1}^{2}$ if neither student is ELL.

Table 2 shows these model-implied marginal standard deviations and correlations with estimates from ML and BM substituted for the parameters. These standard deviation and correlation estimates are remarkably similar, which also explains why the log likelihood evaluated at the BM estimates is not much smaller than that evaluated at the ML estimates.

Table 2.

Marginal standard deviations and correlations of posttest scores given ELL status

	ML	BM
SD of ELL student	16.86	17.06
SD of non-ELL student	17.19	17.26
Correlation of (ELL, ELL)	0.20	0.22
Correlation of (ELL, non-ELL)	0.22	0.21
Correlation of (non-ELL, non-ELL)	0.23	0.24

Note. ELL = English language learner; BM = Bayes modal; SD = standard deviation; ML = maximum likelihood. These values do not differ much between ML and BM although the slope standard deviation estimate and correlation estimate increased notably from ML to BM.

Simulation

We simulated data from the varying coefficient model as described in the preliminary simulation for Figure 1 but with only one covariate. We explored different values of the correlation ρ(0, 0.225, 0.450, 0.675, and 0.900), setting σ to be a moderate value of 0.5. With 1,000 replicated samples generated with J = 5 and n = 30, we estimated the bias and root mean squared error (RMSE) for σ₁, σ₂, and ρ. For ML and REML, the bias and RMSE of $\hat{ρ}$ are based on the replicates that generate legitimate estimates (i.e., when neither ${\hat{σ}}_{1}$ nor ${\hat{σ}}_{2}$ is zero which happened in 1.2% of the replicates for ML and 0.9% of the replicates for REML). For BM estimation, we assigned a Wishart(4,(1/2θ)I) prior on Σ with θ = 10⁻⁴.

Figure 7 shows the proportion of boundary estimates of $ρ (w h e r e 1 - | \hat{ρ} | < 10^{- 5})$ . When ρ is 0, 21% of the ML estimates and 17% of the REML estimates have perfect correlations. As ρ increases, the proportion of $\hat{ρ}$ on the boundary also increases and reaches 60% for ML and 51% for REML. The BM method does not produce any boundary estimates of ρ for any of the simulation conditions.

Figure 7.

Proportion of maximum likelihood (ML) and restricted maximum likelihood (REML) estimates of ρ that are on the boundary. When ρ = 0, 21% of the ML estimates and 17% of the REML estimates are ±1. As ρ increases, the proportion of estimates on the boundary, $\hat{ρ}$ equal to ±1, also increases and reaches 60% for ML and 51% for REML when ρ = .9.

In spite of the absence of boundary estimates, the log likelihood is not reduced substantially by using BM estimation. Investigating the difference in deviances $(= 2 [log L ({\hat{Σ}}_{M L}) - log L ({\hat{Σ}}_{B M})])$ for all the replicates, the BM method never reduces the log likelihood by more than 2.2 from the maximum.

Figure 8 summarizes the estimated bias and RMSE of $\hat{ρ}, {\hat{σ}}_{1}$ , and ${\hat{σ}}_{2}$ . When ρ = 0, the estimated bias of $\hat{ρ}$ is almost zero for all three methods.

Figure 8.

Bias and root mean squared error of ${\hat{σ}}_{1}$ , ${\hat{σ}}_{2}$ , and $\hat{ρ}$ with J = 5 and n = 30 of the varying-coefficient model with σ₁ = σ₂ = .5 and ρ in the grid. In our simulation, with ρ set to various positive values, the bias values are all negative, so we display absolute values to make the graphs easier to read given the convention that high values of bias are bad. Bayes modal (BM) has higher bias for ρ (i.e., shrinking the estimate toward 0) compared to maximum likelihood (ML) and restricted maximum likelihood (REML), but the RMSE is smaller for BM. For both σ₁ and σ₂, BM has smaller bias and RMSE than ML and REML.

ML, REML, and BM all have some bias in estimating ρ, with BM having the most bias (i.e., the most shrinkage toward 0), as would be expected given the regularization from the Wishart prior that squeezes $\hat{ρ}$ toward zero as seen in the shape of the prior density for ρ with d = 2 in Figure 3. However, BM gives the smallest estimated RMSE of $\hat{ρ}$ . The estimated bias of ${\hat{σ}}_{1}$ and ${\hat{σ}}_{2}$ is similar across the different values of ρ for all the estimation methods. The BM estimates of the standard deviations are less biased and have smaller estimated RMSE than ML and REML.

The coverage of 95% confidence intervals for β₀ and β₁ does not change much with ρ. The average coverage of the BM confidence intervals is .940 for β₀ and .943 for β₁. The coverage for REML is about the same as that for BM, whereas ML shows slightly lower coverage with averages of .935 for β₀ and .937 for β₁.

Conclusion

For the hierarchical regression model, particularly with several varying coefficients, degenerate covariance matrix estimates do not have a practical interpretation. Unfortunately, such boundary estimates commonly arise in ML estimation because there is often little information on these parameters when there is only a moderate number of groups. In addition, when $\hat{Σ}$ is singular, underestimated standard errors of the fixed coefficients make the researcher overconfident about the effect of the covariates. When a boundary estimate is attained but no prior information is available for Σ, the BM estimator using the default Wishart prior is recommended because it ensures strictly positive definite $\hat{Σ}$ and is weakly informative at the same time. The modified gllamm from www.gllamm.org for Stata and blme package for R allow straightforward application of our method for practitioners.

In varying-slope models, changing the location and scale of the covariates that have varying slopes implies that Σ must change to produce an equivalent model. For example, for longitudinal data, we might want to transform the time variable to have a value 0 at the initial time point. In this case, subtracting a constant from the covariate changes the variance of the varying intercepts and the correlation between intercepts and slopes. Although ML and REML will yield equivalent models after linearly transforming the covariate, this is no longer true for BM estimation, which pulls the correlation toward 0. When using Bayesian regularization in this setting, it therefore becomes more important to choose meaningful centering points for the covariates with varying coefficients.

Footnotes

Authors’ Note

The data from Math Pathways and Pitfalls Lessons on Students Mathematics Achievement study are Copyright ©2011 by WestEd. All rights reserved. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the author and may not reflect the views, findings, or opinions of the National Science Foundation or WestEd.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research reported here was supported by the Institute of Education Sciences (R305D100017), the National Science Foundation (SES-1323977), and the Army Grant (W911NF-14-1-0020). This data set is based upon work supported by the National Science Foundation under Grant No. 9911374 along with materials developed by WestEd.

References

Agresti

Coull

B. A.

(1998). Approximate is better than “exact” for interval estimation of binomial proportions. The American Statistician, 52, 119–126.

Amemiya

(1985). What should be done when an estimated between-group covariance matrix is not nonnegative definite? The American Statistician, 39, 112–117.

Barnard

McCulloch

Meng

(2000). Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage. Statistica Sinica, 10, 1281–1312.

Bates

Maechler

(2010). lme4: Linear mixed-effects models using s4 classes [Computer software manual]. Retrieved from http://CRAN.R-project.org/package=lme4 (R package version 0.999375-37)

Chung

Rabe-Hesketh

Choi

I.-H.

(2013). Avoiding zero between-study variance estimates in random-effects meta-analysis. Statistics in Medicine, 32, 4071–4089.

Chung

Rabe-Hesketh

Dorie

Gelman

Liu

(2013). A nondegenerate penalized likelihood estimator for variance parameters in multilevel models. Psychometrika, 78, 685–709.

Ciuperca

Ridolfi

Idier

(2003). Penalized maximum likelihood estimator for normal mixtures. Scandinavian Journal of Statistics, 30, 45–59.

Daniels

M. J.

Kass

R. E.

(1999). Nonconjugate Bayesian estimation of covariance matrices and its use in hierarchical models. Journal of the American Statistical Association, 94, 1254–1263.

Fay

R. E.

Herriot

R. A.

(1979). Estimates of income for small places: An application of James-Stein procedures to census data. Journal of the American Statistical Association, 74, 269–277.

10.

Galindo-Garre

Vermunt

(2006). Avoiding boundary estimates in latent class analysis by Bayesian posterior mode estimation. Behaviormetrika, 33, 43–59.

11.

Galindo-Garre

Vermunt

Bergsma

(2004). Bayesian posterior mode estimation of logit parameters with small samples. Sociological Methods & Research, 33, 88–117.

12.

Gelman

Jakulin

Pittau

M. G.

Y. S.

(2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 2, 1360–1383.

13.

Harville

D. A.

(1974). Bayesian inference for variance components using only error contrasts. Biometrika, 61, 383–385.

14.

Kolenikov

Bollen

K. A.

(2012). Testing negative error variances: Is a Heywood case a symptom of misspecification? Sociological Methods & Research, 41, 124–167.

15.

Lehmann

E. L.

Casella

(1998). Theory of point estimation. New York, NY: Springer.

16.

Lahiri

(2010). An adjusted maximum likelihood method for solving small area estimation problems. Journal of Multivariate Analysis, 101, 882–892.

17.

Maris

(1999). Estimating multiple classification latent class models. Psychometrika, 64, 187–212.

18.

Martin

J. K.

McDonald

R. P.

(1975). Bayesian estimation in unrestricted factor analysis: A treatment for Heywood cases. Psychometrika, 40, 505–517.

19.

Mislevy

R. J.

(1986). Bayes modal estimation in item response models. Psychometrika, 51, 177–195.

20.

Morris

(2006). Mixed model prediction and small area estimation (with discussions). Test, 15, 72–76.

21.

Morris

Tang

(2011). Estimating random effects via adjustment for density maximization. Statistical Science, 26, 271–287.

22.

O’Malley

A. J.

Zaslavsky

A. M.

(2005). Cluster-level covariance analysis for survey data with structured nonresponse (Tech. Rep.). Boston, MA: Department of Health Care Policy, Harvard Medical School.

23.

Patterson

H. D.

Thompson

(1971). Recovery of inter-block information when block sizes are unequal. Biometrika, 58, 545–554.

24.

Rabe-Hesketh

Skrondal

Pickles

(2005). Maximum likelihood estimation of limited and discrete dependent variable models with nested random effects. Journal of Econometrics, 128, 301–323.

25.

Srivastava

Kubokawa

(1999). Improved nonnegative estimation of multivariate components of variance. Annals of Statistics, 27, 2008–2032.

26.

Swaminathan

Gifford

J. A.

(1985). Bayesian estimation in the two-parameter logistic model. Psychometrika, 50, 349–364.

27.

Tsutakawa

R. K.

Lin

H. Y.

(1986). Bayesian estimation of item response curves. Psychometrika, 51, 251–267.

28.

Verbeke

Molenberghs

(2000). Linear mixed models for longitudinal data. New York, NY: Springer.

29.

Vermunt

Magidson

(2005). Technical guide for Latent Gold 4.0: Basic and advanced (Tech. Rep.). Belmont, MA: Statistical Innovations.

30.

Warton

D. I.

(2008). Penalized normal likelihood and ridge regularization of correlation and covariance matrices. Journal of the American Statistical Association, 103, 340–349.