Abstract
A concrete example is used to illustrate maximum likelihood estimation of a structural equation model with two unknown parameters. The fitting function is found for the example, as are the vector of first-order partial derivatives, the matrix of second-order partial derivatives, and the estimates obtained from each iteration of the Newton-Raphson algorithm. The goal is to provide a concrete illustration to help those learning structural equation modeling bridge the gap between the verbal descriptions of estimation procedures and the mathematical definition of these procedures provided in the technical literature.
Keywords
Concrete examples are often used in the teaching and learning of mathematics and are advocated at virtually all levels of mathematics education, from K-12 instruction, as evidenced by the standards of the National Council of Teacher’s of Mathematics (National Council of Teachers of Mathematics, 1989) to acquisition of higher level mathematics skills. One can easily extend the reasoning behind the need to provide students with concrete examples and representations beyond the purely mathematical realm into the world of statistics. For basic statistical concepts, like the standard deviation, worked-out examples using a definitional formula are provided in most introductory texts, and additional examples can be readily generated. For more advanced analyses, like multiple regression, canonical correlation, and multivariate analysis of variance, worked-out examples can be found in Pedhazur (1997), Tatsuoka and Lohnes (1988), and Stevens (2002), respectively. As statistical analyses get more complicated, the availability of concrete examples diminishes.
In the area of structural equation modeling, it is relatively difficult to find or generate concrete examples illustrating parameter estimation algorithms. Such examples, however, are valuable in helping learners bridge the gap between the verbal descriptions of estimation procedures and the mathematical definition of these procedures provided in the technical literature. We have found the example illustrating maximum likelihood estimation of a single parameter model given by Bollen (1989) to be useful to learners. Our purpose is to provide a more complex example for learners who benefit from concrete illustrations. We have chosen a two-parameter model for the example. Having multiple parameters enables us to maintain vectors and matrices in the computations, whereas limiting our model to two parameters keeps the computations relatively simple. Maximum likelihood estimation is illustrated because this is the method most commonly used to estimate structural equation models and the first method learners typically try to understand.
Context for the Example
In this example, we consider a single latent exogenous variable, ξ, being used to predict a single latent endogenous variable, η, both of which have single-indicator variables (X and Y, respectively). The model is shown in a path diagram in Figure 1, where ovals indicate the latent variables of interest, rectangles indicate observed variables used as indicators of the latent variables, circles represent errors, single-headed arrows represent effects, and double-headed arrows originating and ending on the same variable indicate variances.
For illustrative purposes, we have assumed known values for five of the model parameters and unknown values for the other two. The measurement error (δ) of indicator variable X is assumed to have a variance of 2, and the measurement error (ɛ) of indicator variable Y is assumed to have a variance of 4. Other parameters that we have assumed a value for include the variance of latent variable ξ to be 8 and the coefficients from both latent variables ξ and η to their single indicators variables, X and Y respectively, to be 1. The two parameters that we are estimating in this example are the coefficient from latent variable ξ to latent variable η, represented by gamma (γ) and the variance of the disturbance of the endogenous latent variable η, represented by psi (ψ).
This model can be mathematically represented by a series of matrices. Using the eight-matrix notation that over the years has been made familiar by the LISREL (Linear Structural RELationships) software package (Jöreskog & Sörbom, 2004), the model parameters can be represented by the following series of matrices:
where Λ x contains the coefficient from ξ to its indicator variable X, Λ y contains the coefficient from η to its indicator variable Y, θδ contains the variance in the measurement error of X, θɛ contains the variance in the measurement error of Y, Φ contains the variance in ξ, β becomes zero because this model has no effects between latent endogenous variables, Ψ contains the variance in the disturbance of η, and Γ contains the coefficient from ξ to η.
Implied Covariance Matrix
If a structural equation model was properly specified, and if we knew all the parameter values, we could compute the model’s implication for the covariance between all pairs of variables. Put another way, the model and parameter values, θ, imply a covariance matrix, ∑(θ). The general form of the implied covariance matrix for structural equation modeling has been discussed (Hayduk, 1987) and derived (Bollen, 1989) elsewhere, and is given by
where I is the identity matrix. By substituting the parameters from our example, this general equation can be simplified, yielding
Observed Covariance Matrix
Assume the observed covariance matrix is
where 20 is the observed variance of X, 10 is the observed variance of Y, and 5 is the observed covariance between X and Y.
Maximum Likelihood Fitting Function
When we estimate parameter values for a structural equation model, we are trying to pick values for the parameters that make the implied covariance matrix as close as possible to the observed covariance matrix, or in other words, we try to minimize a discrepancy function between the model implied covariance matrix and the observed covariance matrix. There are a variety of different ways to define the discrepancy function, and thus alternative ways to estimate the parameters. The most common estimation method, maximum likelihood, is based on a normality assumption that allows the likelihood of the observed covariance matrix to be considered for a given set of parameter values. Different sets of values are considered, and then the set that maximizes the likelihood of the observed covariance matrix is selected. It can be shown that the values that maximize the likelihood of the data are the values that minimize the maximum likelihood fitting function,
where p + q is the number of variables, S is the observed covariance matrix, and ∑(θ) is the implied covariance matrix. Bollen (1989) provides detailed derivation of the likelihood function and shows its relationship to the fitting function. The fitting function can be simplified for our example by substituting our implied and observed covariance matrices,
This equation can be simplified to Equation 6 and further simplified to Equation 7 through matrix algebra (see Hayduk, 1987, or Bollen, 1989, for details about inverses, determinants, traces, and multiplication of matrices in the context of structural equation modeling). The minimization process is unaffected by the constants in the function [log(175) and 2], thus these could be dropped from the equation.
Minimizing the Fitting Function
Numerical methods are used to minimize the fitting function. The algorithm begins with an initial guess for each of the parameters to be estimated. These guesses become the start values, which are then adjusted by considering how the fitting function is changing at these particular values. The adjusted values are then considered and also adjusted. The iterative process of making adjustments continues until it appears that the minimum of the fitting function has been reached. There are a variety of algorithms for minimizing the fitting function (e.g., Newton-Raphson, Fletcher-Powell, Gauss-Newton, Levenberg-Marquardt). The algorithms differ from each other in the details of how the adjustments, or steps, in the process are defined. We will illustrate the Newton-Raphson algorithm because it is relatively straightforward, but complex enough to facilitate thinking about the first- and second-order partial derivatives of the fitting function. A step in the Newton-Raphson algorithm is generally defined by
where the vector of parameter estimates at iteration i + 1, θ̂(
i
+1), involves an adjustment of the parameter estimates from the previous iteration, θ̂(
i
). The size of the adjustment depends on the gradient,
The gradient contains the partial slopes for the surface of the fitting function. As we get closer to the minimum, these slopes approach zero. The Hessian matrix contains information about the rate the partial slopes are changing. Using the gradient and inverse Hessian to adjust the estimates allow relatively large adjustments to be made when we are far from the minimum, and finer and finer adjustments to be made as we get closer to the minimum.
To illustrate the use of the Newton-Raphson algorithm in this example, we must first find the first-order and second-order partial derivatives of the fitting function. In the next section, we will find the first-order partial derivatives of the fitting function for our example, and in the following section, we will find the second-order partial derivatives. Once the partial derivatives have been found, we will return to the Newton-Raphson algorithm and illustrate its application to our example.
Vector of First-Order Partial Derivatives of F ML
An element of the vector of first-order partial derivatives is found by finding the derivative of the fitting function with respect to a particular parameter. For our example, the gradient, or vector of first-order partial derivatives, is
The process of finding the first-order partial derivative of the fitting function with respect to γ for our example is illustrated in Equations 10 through 12. Equation 10 shows that we intend to take the derivative of the fitting function. Equation 11 shows the result of taking the derivative with respect toγ, and Equation 12 shows the result after simplifying the expression.
We then find the first-order partial derivative of the fitting function with respect to ψ for our example, which is illustrated in Equations 13, 14, and 15.
Matrix of Second-Order Partial Derivatives of F ML
The matrix of second-order partial derivatives of F ML , the Hessian matrix, for our example is given by
Note that each element of the Hessian matrix is found by taking the derivative of one of the first-order partial derivatives with respect to one of the parameters.
The first element of the first row of this matrix is obtained by taking the derivative of
The second element of the second row is obtained by taking the derivative of
The off-diagonal elements can be found by either taking the derivative of
Results of Each Iteration
Now that we have solved for the first-order and second-order partial derivatives, we can illustrate the use of the Newton-Raphson algorithm for our example, where each step is defined by
To start the estimation process, we need some initial estimates for γ and ψ. There are alternative ways of doing this, ranging from subjective estimates based on the researcher’s understanding of the model to noniterative estimation approaches like the instrumental variables method (e.g., Bollen, 1989). For simplicity, we chose to provide subjective estimates for the start values, γ̂(1) = .5 and ψ̂(1) = 5. By substituting these values into γ̂(
i
), ψ̂(
i
),
which leads to values of 0.6101465 for γ̂(2) and 7.7471531 for ψ̂(2). These estimates are then substituted into γ̂(
i
)
, ψ̂(
i
),
The estimates are now γ̂(3) = 0.6229589 and ψ̂(3) = 10.53068.
To conserve space, the results of the iterations are summarized in Table 1. In this table, we list the estimates of γ and ψ, the elements of the gradient, and the value of the maximum likelihood fitting function for each iteration. Note that the changes in γ and ψ are relatively large during the first few iterations and get smaller as the iterations continue. Also notice the elements of the gradient and the maximum likelihood fitting function are getting smaller as the iterations increase. For this example, the elements of the gradient become zero on the seventh iteration. In addition, the value for the fitting function becomes zero, which implies the estimates of 0.625 for γ and 12.875 for ψ lead to an implied covariance matrix, ∑(θ), that does not differ from the sample covariance matrix, S.
Alternative Estimation Methods
For this small example, we could have arrived at estimates for γ and ψ, by equating the implied covariance matrix to the sample covariance matrix,
Separate equations can be written for each element, leading to two equations containing two unknowns,
Solving Equation 30 for γ leads to 0.625, and substituting 0.625 for γ in Equation 31 and solving for ψ gives 12.875.
In more complex situations, there are typically more equations than unknowns. Attempts to solve the equations for the unknowns will often lead to multiple estimates for the same unknown because sampling error keeps the elements of S from being identical to the elements of ∑(θ). It is these situations that motivate more complex estimation methods. Different methods (e.g., maximum likelihood, weighted least squares) are based on different assumptions (e.g., maximum likelihood assumes multivariate normality) and will often lead to different estimates, so care should be taken to choose an estimation method appropriate for the application. Finally, it should be noted that iterative methods will often not reach a fitting function value of zero. We say that we have met the convergence criterion when all elements of the gradient become less than some set value (e.g., .000001), which implies the estimates are changing trivially from one iteration to the next.
Closing Comments
For the purpose of illustrating maximum likelihood estimation in the context of structural equation modeling it was useful to focus on a relatively simple model. As models and data become more complex, researchers should be aware that estimation does not always proceed as smoothly as it did with this example. If estimation problems are encountered, it may be useful to consider the following strategies: (a) changing the start values, (b) changing the estimation algorithm, (c) increasing the maximum number of iterations, (d) screening the data for outliers and other anomalies, (e) using boundary constraints, (f) rescaling any variable with a variance many orders of magnitude larger than the variances of the other variables, (g) checking the identification status of the model, (h) checking the consistency between the model specified in the structural equation modeling software and the path diagram, and (i) reexamining the consistency between the path diagram and theoretical expectations. For details on the problems that can be encountered, potential solutions, and limitations to these potential solutions, see Chen, Bollen, Paxton, Curran, and Kirby (2001); Wothke (1993); and Bentler and Chou (1987).
Footnotes
JOHN M. FERRON is Professor of Educational Measurement and Research at the University of South Florida, 4202 East Fowler Ave. EDU 162, Tampa, FL 33620;
MELINDA R. HESS is Director of the Center for Research, Evaluation, Assessment, and Measurement at the University of South Florida, 4202 East Fowler Ave. EDU 162, Tampa, FL 33620;
