Two-step analysis of hierarchical data

Abstract

In this article, we describe the package twostep, a bundle of programs to perform analyses of hierarchical data applying the two-step approach. We consider a two-level data setup in which “microlevel” units are nested within “macrolevel” units. One-step models (which can be fit using, for example, mixed) are the most common approach to modeling two-level data. The two-step approach is an alternative in which parameters associated with microlevel and macrolevel predictors are estimated separately for each level. It can be used as an alternative to one-step models if the estimand is a cross-level interaction. We also show how the two-step approach usefully complements one-step approaches by providing exploratory data analysis, descriptive graphs, and regression diagnostics.

Keywords

st0745 twostep hierarchical model mixed model multilevel analysis two-step modeling two-stage regression estimated dependent variable regression EDV exploratory data analysis EDA cross-level interaction

1 Introduction

Datasets in which lower-level units are nested within higher-level units are frequent in many disciplines. Prime examples in the social sciences are data in which students are clustered within schools, or respondents of international surveys are clustered within countries. In medical research, we find cluster randomized trials, in which, for example, individuals are assigned to receive a control or intervention condition within health clinics or villages (Hayes and Moulton 2017). Observational studies of animals clustered by habitats or cells in organisms are further examples. In linguistics, words are nested within corpora. Further examples from various disciplines could be easily listed.

Data containing several levels are typically termed “hierarchical data” or “multilevel data”. In this article, we focus exclusively on the two-level case, which is arguably the most common. In what follows, we use the term “microlevel” to refer to the first (lowest) level and “macrolevel” to refer to the second level within which first-level units are nested. Correspondingly, microlevel covariates are those that differ across first-level units, and macrolevel covariates are those that vary across second-level units and are common to clusters of first-level units.

The analysis of hierarchical data poses challenges, primarily for the estimation of standard errors but also for the estimation of the effects of macrolevel covariates on a microlevel outcome or of interactions between microlevel and macrolevel covariates (cross-level interactions [CLIs]). The two-step approach discussed here is primarily targeted for studying such higher-level effects and particularly CLIs.

In general, four approaches to modeling hierarchical data can be distinguished (Bryan and Jenkins 2016a, 4–5): regression analysis with cluster–robust standard errors, fixed macrolevel intercept terms, mixed models with macrolevel random effects, and separate analyses on the microlevel for each unit of the macrolevel with subsequent analysis of their results. This article focuses on the last approach, which is commonly referred to as the two-step approach. The two-step approach can be used for hierarchical data with two levels and is particularly useful for data with many observations at the microlevel and a rather limited number of observations at the macrolevel; see section 2 for merits of the two-step approach.

The implementation of the two-step approach in Stata is very simple, as can be shown by the following example using data from a cross-nationally comparative European survey in which individual respondents are nested within countries (discussed further in section 4). After loading the data,

we run a regression for each country (the macrolevel units) separately and write the regression coefficients to the active data frame. In our example, we regress the respondents’ general life satisfaction on household income, sex, and age:

In the second step, selected regression coefficients of these regression models are further analyzed. The standard case is to study their associations with macrolevel covariates. Continuing with our example, we merge a file containing macrolevel variables to the estimated country-level regression coefficients and study how the effect of household income on life satisfaction covaries with a country’s gross domestic product (that is, the CLI of household income and gross domestic product [GDP]):

Typical extensions to this approach involve more formal refinements of the second step’s regression models, exploratory data analysis (EDA) techniques concerning both steps’ analyses, meaningful graphical displays of the results, or the diagnosis of the assumptions of other approaches to study hierarchical data. This article presents the new command twostep, which allows users to perform all of these tasks using a single command. Section 2 presents the basic idea behind the two-step approach. The syntax of the command is outlined in section 3, while section 4 illustrates the many features of twostep using a European dataset in which individual survey respondents are nested within countries.

2 The two-step approach in context

Applying the two-step approach to hierarchical data serves various intentions. Setting up regression models—known as hierarchical or multilevel models—is probably the most important objective researchers have in mind when facing hierarchical data. Accordingly, section 2.1 discusses to what extent the two-step approach can offer an alternative to other approaches that are commonly applied to model hierarchical data. However, the two-step approach can also be used to facilitate refinement of hierarchical models by providing tools for further data exploration, visual representation of results, and conducting diagnostic checks. These applications of the two-step approach are discussed in section 2.2. Section 2.3 documents the methods and formulas for twostep’s regression models.

2.1 Modeling hierarchical data

The defining characteristic of hierarchical data is its nested data structure. In the following, we consider the case where microlevel units i are nested within macrolevel units j; that is, we consider a two-level structure. Given this data structure, a typical (linear) model is

y_{i j} = β_{0 j} + x_{i j} β + ϵ_{i j}

β_{0 j} = γ_{00} + z_{j} γ + u_{0 j}

where y_ij is a continuous outcome variable and x refers to a row vector containing (observed) microlevel characteristics of units i = 1,…, N_j nested in the macrolevel units j = 1,…, J. z refers to a row vector containing (observed) variables holding macrolevel characteristics. ϵ_ij and u₀ _j refer to unobserved effects at the microlevel and macrolevel, respectively. β and γ are column vectors containing the regression parameters to be estimated.

By inserting (2) into (1), we can also express this model as

y_{i j} = γ_{00} + x_{i j} β + z_{j} γ + u_{0 j} + ϵ_{i j}

This model is sometimes called a direct context effect (DCE) model (Heisig, Schaeffer, and Giesecke 2017) because it directly relates macrolevel characteristics to levels of y_ij.

The DCE model allows y_ij to depend on x and z, but it does not contain interaction terms of x and z. To allow such “CLIs”, we can extend the DCE model to

y_{i j} = β_{0 j} + β_{k j} x_{k i j} + x_{\tilde{k} i j} β_{\tilde{k}} + ϵ_{i j}

β_{0 j} = γ_{00} + z_{j} γ_{0} + u_{0 j}

β_{k j} = γ_{k}_{0} + z_{j} γ_{k} + u_{k j}

where now $x_{\tilde{k}}$ refers to a row vector containing (observed) microlevel characteristics excluding x_k, the variable, whose effect is expected to vary with levels of macrolevel characteristics z. u₀ _j , u_kj, and ϵ_ij refer to unobserved effects at the macrolevel and microlevel, and β and γ are regression parameters to be estimated. We have specified only a single CLI for illustrative purposes, but the model can contain multiple CLIs.

Again, the multiple-equation representation above can be converted into a singleequation representation by inserting (4) and (5) into (3):

y_{i j} = γ_{00} + z_{j} γ_{0} + γ_{k 0} x_{k i j} + z_{j} x_{k i j} γ_{k} + x_{\tilde{k} i j} β_{\tilde{k}} + u_{0 j} + u_{k j} x_{k i j} + ϵ_{i j}

Such models are commonly known as CLI models, because the effect of at least one microlevel variable is assumed to depend on the levels of at least one macrolevel variable.

Bryan and Jenkins (2016a, 4–5) differentiate between four general approaches to estimating the parameters of interest in DCE and CLI models. In all of these four approaches, the unmeasured effects u₀ _j , u_kj, and ϵ_ij are assumed to be uncorrelated with both x _ij and z _j . Moreover, u₀ _j and u_kj are assumed to be uncorrelated with ϵ_ij.

Pooling the data¹ using cluster–robust standard errors. This approach allows estimation of both DCE and CLI models. To account for the clustering of the data, this approach is implemented in Stata with the option vce(cluster varname) for regress (and many other estimation commands). Allowing for cluster–robust standard errors implies that the combined error term of the model (for example, u₀ _j + ϵ_ij for the DCE model) can be correlated within macrolevel unit j and be heteroskedastic across macrolevel units j.²

Pooling the data and including a set of dummy variables for the macrolevel units. By directly allowing fixed-effects for all macrolevel units, this approach does not allow the inclusion of macrolevel characteristics z _j . Thus, fitting DCE models using this approach provides only estimates for β but not for γ . However, including CLIs in the model is perfectly possible. Note that this approach directly estimates u₀ _j, so that it is no longer part of the combined error term in the single-equation representations. Moreover, estimation of β is based only on variation of y and x within the macrolevel unit j. In Stata, this approach can be implemented by inserting dummy variables identifying the macrolevel units into the regression equation, but often, using the aregor xtreg(see [R] areg or [XT] xtreg) command with the option fe is preferable to reduce the amount of output. Moreover, these commands may be combined with vce(cluster varname) or vce(robust) to account for unmodeled heteroskedasticity or correlated error terms, or both.

Pooling the data and specifying random effects. This approach permits fitting both DCE and CLI models; that is, some of or all the β coefficients are allowed to vary between the macrolevel units j. As in (4) and (5), β₀ _j and β_kj can be modeled depending on z _j and on unobserved macrolevel effects (that is, u₀ _j and u_kj). These unobserved effects at the macrolevel are assumed to be (multivariate) normally distributed. In Stata, this approach is implemented in mixed (see [ME] mixed) with model fitting via maximum likelihood or restricted maximum likelihood; a special case is xtreg with option re. With mixed, the structure of the covariance matrix for the random effects can be specified by covariance(vartype). Again, unmodeled heteroskedasticity and correlated error terms may be accounted for by using vce(cluster varname) or vce(robust). Moreover, by using the option residuals(restype) of mixed, you can model specific patterns of within-group errors.

Fitting separate models for each unit j of the macrolevel. When you stratify the analysis on the units of the macrolevel, all β coefficients are permitted to vary between the macrolevel units. This is to say we get estimates for β₀ _j and for the full set of β _j . In this approach, γ cannot be directly estimated because z _j is a constant in each of the separate models. Instead, estimates for γ can be obtained in a second step in which the estimates ${\hat{β}}_{j}$ are used as outcome of a regression on z _j . This approach thus allows fitting CLI models. However, it does not allow estimation of DCE models, because ${\hat{β}}_{0 j}$ represents the expected value of y_ij in unit j given that all x _ij in unit j are zero. Because of its two-step nature, this approach is commonly termed the two-step approach as opposed to the three onestep approaches discussed in 1 to 3. (Hanushek 1974; Saxonhouse 1976; Borjas and Sueyoshi 1994; Card 1995; Kedar and Shively 2005; Achen 2005; Donald and Lang 2007).³

Each approach has its pros and cons; see Bryan and Jenkins (2016b, 5–10) for a lucid comparison. In brief, the random-effects approach—if specified correctly—is more efficient than the two-step approach, and both approaches show better statistical performance when compared with the cluster–robust and the fixed-effects approach (for a simulation study, see Heisig, Schaeffer, and Giesecke [2017]). However, the twostep approach is a valuable and easy-to-implement alternative to the random-effects approach, in particular if the number of microlevel observations within each macrolevel unit is large and the number of macrolevel units is small.⁴ In the following, we want to highlight three arguments for applying the two-step approach for analyzing hierarchical data.

First, the two-step approach brings the scarcity of data points at the macrolevel to the attention of the researcher, which may motivate caution regarding influential data points. Consequently, the two-step approach also highlights to the researcher that there is a practical limit on the number of macrolevel variables that can be included in the second-step regression (Bryan and Jenkins 2016b, 10).

Second, the random-effects approach relies on several distributional assumptions to estimate the β and γ coefficients. Some of these assumptions are not necessary for the two-step approach—such as multivariate normally distributed unobserved macrolevel effects. Moreover, in the random-effects approach, the estimates of β _j coefficients are partly based on the microlevel data from macrolevel units other than j. This feature is colloquially termed borrowing strength. However, the underlying (distributional) assumptions to justify borrowing strength may sometimes be quite unrealistic. An example is when national culture or institutions make an effect of some intervention hardly transferable to other countries (Bryan and Jenkins 2016b, 6). Moreover, borrowing strength will lead to bias in β estimates if there is a correlation between u_j and the microlevel variables. In contrast, step 1 of the two-step method is immune to this.

A more formal statistical problem is that the standard errors of the macrolevel coefficients γ from one-step approaches with scarce macrolevel data are imprecise and likely to be biased downward (Hox, Moerbeek, and van de Schoot 2018, 212; Raudenbush and Bryk 2002, 283). Moreover, with respect to the random-effects approach, recent simulation studies show that researchers should allow far more random coefficients for data with scarce macrolevel units compared with the standard practice (Heisig, Schaeffer, and Giesecke 2017; Heisig and Schaeffer 2019). However, in practice this request often leads to problems of convergence in the estimation of the models. At the same time, an incorrectly specified random-effects structure may impede correct statistical inference in these models. The two-step approach is an alternative in such situations because it relies on much weaker distributional assumptions.

For such problems, the two-step approach could be used as a direct alternative to the one-step approach. However, the two-step approach has its own challenges. The model estimated at the second step is heteroskedastic because sampling errors of the parameters estimated at the microlevel are likely to vary between the macrolevel units. It has been pointed out, though, that with many observations at the microlevel, the sampling errors should be sufficiently small so that the second step can done by ordinary least squares (OLS) (Donald and Lang 2007; Wooldridge 2010, 891–892). Moreover, Hanushek (1974), Borjas and Sueyoshi (1994), Lewis and Linzer (2005), and Donald and Lang (2007) provide weighting matrices for feasible generalized least squares (FGLS). According to Bryan and Jenkins (2016b, n. 4), those FGLS estimates are consistent only (and distributed normally) for large numbers of units at the macrolevel, leading Bryan and Jenkins (2016b) to prefer the OLS over FGLS for the second step.

2.2 Descriptive hierarchical data analysis

While statistical inference about γ might be difficult if the number of observations at the macrolevel is small, we can nonetheless study the parameters of the microlevel models using less formal descriptive methods. These descriptive methods are useful irrespective of potential problems concerning statistical inference and may inspire improvements of the more formal hierarchical models.

Three specific—but strongly overlapping—descriptive purposes come to our mind:

EDA. The two-step approach gives direct access to the first-step’s model parameters. This dataset can be studied with any of the many techniques for EDA; see, for example, Bowers and Drake (2005). Particularly, scatterplots between the coefficients estimated in the microlevel models and macrolevel variables may inspire further hypotheses. twostep implements straightforward generalizations of the bivariate scatterplot for the multivariate case.

Descriptive graphs. The graphical techniques used for EDA can also be used as a graphical device for presenting the results. In addition, a simple graph of the coefficients estimated in the microlevel models for each macrolevel unit may be used. This graph can be also used to communicate estimates for β for research not interested in effects of higher-level variables.

Diagnostics. Because the two-step approach does not pose any assumptions on the distribution of the parameters in the microlevel models, the actual distribution of those parameters can be used to check the validity of the assumptions of the one-step approach.

The command twostep is designed to ease these descriptive usages of the two-step approach. Thus, it not only provides a syntactical framework for using the two-step approach to estimate CLIs but also allows the researcher to complement a specific onestep approach with EDA, descriptive graphs, and further diagnostics. Section 3 describes the syntax of the command and its subcommands.

Before describing the actual command, we document the formulas for twostep’s direct alternatives to the one-step hierarchical models.

2.3 Methods and formulas

Closely following Lewis and Linzer (2005), this subsection documents the formulas for all methods implemented in twostep’s macrolevel regression model. We start the presentation by reprinting (3) and refer to page 216 for the definition of the terms.

y_{i j} = β_{0 j} + β_{k j} x_{k i j} + x_{\tilde{k} i j} β_{\tilde{k}} + ϵ_{i j}

Using the two-step approach, we first estimate (7) separately for each macrolevel unit j, yielding J estimates for each of the β _j coefficients (J being the number of macrolevel units). In the second step, we use these estimates as the outcome variable, which is regressed on macrolevel characteristics z. Importantly, in the second step, we have an estimated dependent variable (EDV), which should be appropriately considered in the estimation of the macrolevel model. For example, for ${\hat{β}}_{k j}$ this would result in considering ${\hat{β}}_{k j}$ being a composite of the true values of β_kj and the sampling error in ${\hat{β}}_{k j}$ , which we denote by ζ_kj, with E(ζ_kj) = 0 and $Var (ζ_{k j}) = E (ζ_{k j}^{2})$ :

{\hat{β}}_{k j} = β_{k j} + ζ_{k j}

If β_kj is modeled to depend on z as in (5), this results in

{\hat{β}}_{k j} = γ_{k 0} + z_{j} γ_{k} + \underset{ν_{k j}}{\underset{︸}{u_{k j} + ζ_{k j}}}

The error term in (8) is a composite of the sampling error ζ_kj and the error of the macrolevel regression. In line with (5), we denote the latter with u_kj and the joint error term with ν_kj.

The sampling variance of ${\hat{β}}_{k j}$ is likely to differ between macrolevel units (because of different model fits or numbers of observations when fitting the first step models, for example). In this case, the joint error term of the macrolevel regression (ν_kj) is heteroskedastic to the extent of the variation in ζ_kj. Assuming that u_kj and ζ_kj are independent of each other and that there is a constant variance of u_kj across all macrolevel units j [that is, $Var (u_{k j}) = σ_{k}^{2}$ ], the variance of ν_kj is

Var (ν_{k j}) = σ_{k}^{2} + Var (ζ_{k j})

where $σ_{k}^{2}$ is the (homoskedastic) variance of u_kj and Var(ζ_kj) is the heteroskedastic sampling variance of ζ_kj. If both $σ_{k}^{2}$ and Var(ζ_kj) were known, an efficient estimation of γ would be found in the weighted least-squares regression

{\hat{β}}_{k j} w_{k j} = γ_{k 0} w_{k j} + z_{j} γ_{k} w_{k j} + ν_{k j} w_{k j}

where the weights w_kj are defined as

w_{k j} = \frac{1}{\sqrt{σ_{k}^{2} + Var (ζ_{k j})}}

Equation (11) demonstrates that those macrolevel units j with lower (higher) sampling error with respect to estimating β_kj in the first step get higher (lower) weights in the second-step estimation.

Using (9), the proportion of the total error variance in the macrolevel model arising from the sampling variance in the microlevel models can be defined as

\begin{array}{l} Sampling variance proportion for EDV {\hat{β}}_{k} = \frac{\sum_{j} Var (ζ_{k j})}{\sum_{j} v_{k j}^{2}} \\ = \frac{\sum_{j} Var (ζ_{k j})}{J \times σ_{k}^{2} + \sum_{j} Var (ζ_{k j})} \end{array}

The sampling variance proportion is displayed below the table of coefficients in the corresponding output of twostep. A sampling variance proportion of either zero or one indicates that constraints have been applied upon request of the users or to overcome problems in the estimation of either $σ_{k}^{2}$ or Var(ζ_kj).

In practice, $σ_{k}^{2}$ is unknown, and Var(ζ_kj) is estimated only in the microlevel regressions. Thus, w_kj needs to be estimated to apply (5). twostep provides four such estimates:

The simplest case sets all weights to be constant; that is,

w_{k j} = c o n s t a n t

which basically assumes that ζ_kj does not vary between macrolevel units so that Var(ν_kj) is assumed to be constant as well. Setting all w_kj to a constant term results in (5) being estimated by OLS. This yields efficient estimates and consistent standard errors of the γ coefficients as long as ζ_kj is constant across all macrolevel units. OLS might still perform well if ζ_kj shows variation across macrolevel units, but then Var(ζ_kj) has to be very small relative to $σ_{k}^{2}$ [which would result in a small sampling variance proportion; see (12)]. Typically, this happens if the number of microlevel units within each macrolevel unit is large, so that the contribution of ζ_kj to Var(ν_kj) becomes negligible; see Donald and Lang (2007, 223–224). However, if there is significant variation of ζ_kj, OLS leads to inefficient estimation of γ and to inconsistent estimation of standard errors; see Lewis and Linzer (2005, 350). This will also be the case if $σ_{k}^{2}$ is nonconstant across the macrolevel units (that is, if the u_kj turn out to be heteroskedastic). To safeguard statistical inference in situations like these, twostep allows option vce(robust) to request robust standard errors for the macrolevel model.

Saxonhouse (1976) proposes setting the weights to

w_{k j} = \frac{1}{{\hat{ζ}}_{k j}}

where ${\hat{ζ}}_{k j}$ is the estimated standard error of the regression coefficient ${\hat{β}}_{k j}$ estimated using the microlevel data of macrolevel unit j. This approach assumes that all uncertainty in (8) stems from the uncertainty in the microlevel estimation of β_kj, while $σ_{k}^{2}$ is assumed to be zero. This assumption also implies that the sampling variance proportion is 1 [see (12)]. The latter assumption means that the macrolevel model would explain β_j completely if it could be measured without sampling error. If this assumption is violated, weighted least squares with these weights lead to inefficient estimation of γ and inconsistent estimation of their standard errors; see Lewis and Linzer (2005, 350).

Hanushek (1974) proposes to set the weights to

w_{k j} = \frac{1}{\sqrt{{\hat{σ}}_{k}^{2} + {\hat{ζ}}_{k j}^{2}}}

with ${\hat{ζ}}_{k j}$ being the estimated standard errors of the microlevel regression coefficient ${\hat{β}}_{k j}$ and ${\hat{σ}}_{k}^{2}$ defined as

{\hat{σ}}_{k}^{2} = \frac{\sum_{j} {\hat{ν}}_{k j}^{2} - \sum_{j} {\hat{ζ}}_{k j}^{2} + tr {{(Z^{'} Z)}^{- 1} Z^{'} G_{k} Z}}{J - cols (Z)}

with ${\hat{ν}}_{k j}$ being the residuals of an unweighted regression of ${\hat{β}}_{k j}$ on the macrolevel variables z _j [as in (8)]; that is,

{\hat{ν}}_{k j} = {\hat{β}}_{k j} - ({\hat{γ}}_{k 0} + {\hat{γ}}_{k} z_{j})

and G _k is an J × J diagonal matrix holding ${\hat{ζ}}_{k j}^{2}$ in the diagonal. Z is a matrix containing all variables summarizing macrolevel characteristics for all macrolevel units plus a constant term. tr(·) is the trace operator, and cols(Z) denotes the number of columns in Z, that is, the number of macrolevel variables plus one. ${\hat{σ}}_{k}^{2}$ is set to zero if the estimation returns negative values. This is indicated in the output of twostep by a warning message and—in line with (12)—a sampling variance proportion of 1.

Hanushek’s method assumes that the microlevel estimates of $ζ_{k j}^{2}$ are known or that they can at least be reliably estimated. The latter becomes increasingly justifiable with the number of microlevel observations increasing; see the simulations by Lewis and Linzer (2005).

Lewis and Linzer (2005, 353) derive another set of weights for the case that the number of observations used for microlevel regressions is known, but the standard errors of the β coefficients are not. In this situation, one can use the fact that Var(ζ_kj) is proportional to 1/N_j and that $ν_{k j}^{2} = σ_{k}^{2} + Var (ζ_{k j})$ . The weights are defined as

w_{k j} = \frac{1}{\sqrt{\hat{a} + \hat{b} \times \frac{1}{N_{j}}}}

with $\hat{a}$ and $\hat{b}$ being the estimated constant and coefficient of an OLS regression of on 1/N_j. Thereby, ${\hat{ν}}_{k j}^{2}$ are the squared residuals of an unweighted OLS regression of ${\hat{β}}_{k j}$ on the macrolevel variables z _j . If $\hat{a}$ is negative, the regression model is refit with a constrained to be zero. Because a is taken as an estimator of $σ_{k}^{2}$ , constraining a to zero leads to a sampling variance proportion of 1. Again, twostep will issue a warning message if constraints are applied.

Simulations for more specific characteristics of the estimators under various conditions can be found in Lewis and Linzer (2005), Donald and Lang (2007), Bryan and Jenkins (2016a), Heisig, Schaeffer, and Giesecke (2017), and Heisig and Schaeffer (2019).

3 The command twostep

twostep is not a single command but an interface to a suite of programs with a common syntax. The following description is streamlined to highlight the commonalities of all the programs implemented in twostep. We start by discussing the general syntax of twostep and then give an overview of each of its building blocks.

3.1 General syntax

The general syntax of twostep consists of three sections, one to declare the hierarchical structure of the data, one to specify the models to be estimated at the microlevel, and one to specify the analysis to be performed at the macrolevel. The three sections are separated as follows:

twostep declaration : microlevel_command || macrolevel_command

In the following, we discuss commonalities of the syntax for each of these three sections. We then continue by describing the microlevel commands as well as the macrolevel commands.

3.1.1 Declaration

The hierarchical structure of the data is declared using the prefix command twostep with the following syntax:

twostep macroid [, stats(namelist)] :

macroid refers to a varlist that identifies the macrolevel units in the combined micro– macro-level data. Factor-variable notation is not allowed.

The option stats(namelist) can be used to parse summary statistics of the first step’s microlevel models into some of the second step’s macrolevel commands. namelist can be any scalar stored by the estimation command used for the microlevel regression.

3.1.2 Microlevel command

The microlevel_command can be either a Stata estimation command or one of the special-purpose commands microcpr and microdfb. The standard case is to specify an estimation command.

Users may specify any estimation command such as regress, logit, areg, etc.; see [U] 20 Estimation and postestimation commands. We refer to the description of those commands for their syntax. Any options except the reporting options allowed for the selected microlevel model can be used.

The special-purpose command microcpr draws component-plus-residual plots for all the microlevel regression models; see [R] regress postestimation diagnostic plots. Correspondingly, the special-purpose command microdfb draws boxplots of DFBETAs for each microlevel regression model; see [R] regress postestimation.

3.1.3 Macrolevel command

The macrolevel_command may be one of the following special-purpose commands, which we order here according to the typical uses proposed in section 2:

Estimating CLIs

edv macrodepvar macroindepvars [if] [in] [using filename] [, options]

EDA

avplot macrodepvar macroindepvars [if] [in] [using filename] [, options]

cprplot macrodepvar macroindepvars [if] [in] [using filename] [, options]

regby macrodepvar macroindepvars [if] [in] [using filename] [, options]

macrodepvar macroindepvars [if] [in] [using filename] [, options]

(The last syntax does not contain the name of a macrolevel command; see section 3.3.)

Descriptive graphs

dot macrodepvar macroindepvars [if] [in] [using filename] [, options]

Diagnostics

cmd macrodepvar macroindepvars [if] [in] [using filename] [, options]

Allow more customized analyses

mk2nd macrodepvar macroindepvars [if] [in] [using filename] [, options]

All the macrolevel commands have a common syntax. For all the commands, macrodepvar refers to a coefficient or summary statistic of the first step’s microlevel model. The coefficients are referred to by adding _b_ in front of a variable name or, for factor variables, by adding _b_#_, with # referring to the category of the respective variable. We suggest using the macrolevel command mk2nd for checking the implied coefficient names for factor variables with interactions. Summary statistics requested for the microlevel models can be accessed in the second step’s analysis by preceding their names with _stat_.

macroindepvars refer to a varlist of macrolevel variables. Factor-variable notation is allowed; see [U] 11.4.3 Factor variables. The variables specified must exist, either in the data in memory or in the file specified by using.

filename refers to a dataset containing the macrolevel variables except for the statistics estimated at the microlevel. In this dataset, the macroid specified in the declaration of twostep must uniquely identify the observations. If using is not specified, the macrolevel variables must be stored in the dataset in memory.

weights are not allowed for any of the macrolevel commands.

3.2 Specific remarks on macrolevel commands

3.2.1 edv

The macrolevel command edv estimates the fitted dependent variable model (EDV model) of macrodepvar on macroindepvars, or in other words: it estimates the coefficients of the CLIs between a selected microlevel variable and macroindepvars. More specifically, twostep with edv performs three tasks:

It fits the microlevel models separately for each macrolevel unit.

It creates a variable for weighting the macrolevel model; see section 2.3 for the calculation of the weighting variable.

It runs a linear regression of the selected coefficient on the specified macrolevel variables, weighted by the created weight variable.

twostep allows any microlevel estimation command in the first step. Hence, it is technically possible to run the EDV model on microlevel coefficients estimated by logit, probit, xtreg, etc. Because the statistical properties of the EDV model have been shown only for linear regression, twostep issues a warning message when the microlevel command is not regress.

twostep with edv is an estimation command that stores its results as ereturn. This provides access to postestimation techniques, such as estimates table. However, because the model is fit in the macrolevel data, and twostep leaves the original microlevel data unchanged, postestimation commands such as predict cannot be used. Users who want to apply the full set of postestimation commands are asked to take advantage of the macrolevel command mk2nd and the standalone command edv.

The command shares the syntax of all macrolevel commands but allows the following specific options:

method(name) defines the function used to weight the observations in the EDV model. name can be any of fgls1 (default), fgls2, ols, and wls:

fgls1 weights the observations using the FGLS approach described by Hanushek (1974). This is the default.

fgls2 weights the observations using the alternative FGLS approach described by Lewis and Linzer (2005, 352–354). This option is particularly useful if the standalone version of edv is invoked on a macrolevel dataset created from estimates printed in an article without reporting standard errors.

ols does not apply weights to the second-level regression. Because the error term of the macrolevel regression is likely to be heteroskedastic, users choosing edv with method ols may also want to consider using option vce(robust).

wls weights the observations by the reciprocal value of the coefficient’s standard errors estimated in the first-step regression; see Saxonhouse (1976) and King (1997, 290); also see Lewis and Linzer (2005, 350).

regress_options all options allowed for [R] regress.

3.2.2 avplot

The macrolevel command avplot creates added-variable plots ([R] regress postestimation diagnostic plots) for an EDV model. More specifically, with avplot, twostep performs three tasks:

It fits the microlevel models separately for each macrolevel unit.

It runs a linear regression of depvar on varlist; by default, the regression is weighted according to the default setting of the command edv, that is, method(fgls1).

It creates an added-variable plot with an overlaid regression line, whereby the marker symbols are drawn in a size proportional to the weights of the EDV model. In addition, all marker symbols are labeled in a lighter shade of gray with the value labels or values of macroid.

In the special case of a macrolevel model with just one independent variable, the added-variable plot is identical to a bivariate scatterplot of the mean-centered outcome variable against the mean-centered independent variable.

The command shares the syntax of all macrolevel commands but allows the following specific options:

regress_options refer to the options noconstant, hascons, tsscons, and vce(vcetype); see [R] regress.

method(name) refers to the option method() of twostep’s macrolevel command edv.

mlabopts(options) defines the labels used to describe the markers of the graph. options can be any of the options allowed for twoway scatter, with mlabel(), mlabsize(), mlabpos(), and mlabcolor() being the most natural choices. msymbol() can be used also, but in this case the graph will display a marker symbol in addition to those shown already. Use mlabopts(mlabcolor(none)) to get rid of all the labeling.

regopts(options) defines the appearance of the regression line. options can be any of the options allowed for twoway line. Among these options, lcolor(), lwidth(), and msize() may be particularly useful.

scopts(options) defines the look of the symbols for the macrolevel observations. options can be any of the options allowed for twoway scatter. Among these options, msymbol(), msize(), and mcolor() may be particularly useful.

twoway_options are options allowed for [R] graph twoway.

3.2.3 cprplot

The macrolevel command cprplot creates component-plus-residual plots for an EDV model. More specifically, twostep with cprplot performs three tasks:

It fits the microlevel models separately for each macrolevel unit.

It runs a linear regression of depvar on varlist; by default, the regression is weighted according to the default setting of the macrolevel command edv.

It draws a component-plus-residual plot with an overlaid regression line and LOWESS, whereby the marker symbols are drawn in a size proportional to the weights. In addition, all marker symbols are labeled with the value labels or values of macroid.

In the special case of a macrolevel model with just one independent variable, the component-plus-residual plot is identical to a bivariate scatterplot of the mean-centered outcome variable against the independent variable.

The command shares the syntax of all macrolevel commands. It allows the same options as the macrolevel command avplot and the following:

lowessopts(options) defines the appearance of the lines for the nonparametric regression line. options can be any of the options allowed for twoway line. Among these options, lcolor(), lwidth(), and lpattern() may be particularly useful.

3.2.4 regby

The macrolevel command regby creates a plot that shows regression lines of all the microlevel regressions by groups created from macrolevel and microlevel data similarly to what has been proposed by Bowers and Drake (2005). More specifically, with regby, twostep performs three tasks:

It fits the microlevel models separately for each macrolevel unit.

It classifies the macrolevel observations into groups defined by macrolevel variables specified with macroindepvars. By default, these groups are created by crossclassifying dichotomized versions of those macrolevel variables.

It creates plots of the coefficients estimated in the first step and selected by macrodepvar. More specifically, it plots regression lines for each macrolevel observation and groups these plots according to the (cross-classified) groups defined by macroindepvars. The plotted regression lines are defined by a slope that corresponds to the value of the coefficient estimated in the first step and are plotted over the range of the underlying microlevel variable.

To ease the comparison between groups, we have each of the subgraphs also show lines for all macrolevel observations in a lighter shade of gray.

The command shares the syntax of all macrolevel commands and allows the following specific options:

allopts(options) defines the appearance of the lines in the background of the subgraphs. By default, regby draws a background graph showing the regression lines for all macrolevel units. This is intended to ease the comparison of each subgroup with the other groups. However, the background graph can be a nuisance in some situations and is perhaps also a matter of taste. The background options allow the user to fine tune the background graph, including a complete removal. options can be any of the options listed in [G-3] line_options . Specify allopts(lcolor(none)) to remove the background graph.

byopts(options)is used to define the overall arrangement of the regby plot. options can be any of the options allowed for [G-3] by_option . Among these options, rows(#), cols(#), and compact may be particularly useful. We stress that the overall titles, subtitles, notes, and captions should be specified here.

discrete(macrovarlist) turns off the default categorization of the variables in the macrovarlist (a subset of macrodepvars). Assuming continuous macrolevel variables to be the standard case, regby dichotomizes all the macrolevel variables by default. This automatic grouping is turned off for all variables in the macrovarlist. The option allows any groupings of the macrolevel units by feeding custom-made variables into the discrete() option. See option nquantiles() for other means to change the default categorization of macrolevel variables.

nquantiles(#) defines the number of groups created from each of the macrolevel variables. Assuming continuous macrolevel variables to be the standard case, regby dichotomizes all the macrolevel variables at their median by default. The option nquantiles(#) defines the number of quantile groups to be created as #. For example, nquantiles(4) groups macrolevel variables into four groups by using the 1st, 2nd, and 3rd quartile. See option discrete() for other means to control the grouping.

regopts(options) defines the appearance of the regression lines. options can be any of the [G-3] line_options . The line options lcolor() and lwidth() may be particularly useful.

microby(varlist) allows additional groupings based on microlevel variables. In the presence of microby(), the microlevel regression models are fit separately for each combination of the macrolevel identifier and the variables specified. The microby() variables are then also used for the definition of the various subgraphs.

twoway_options are options allowed for [G-2] graph twoway.

3.2.5 dot

The macrolevel command dot creates horizontally labeled dot charts with confidence intervals of the microlevel regression coefficients. More specifically, with dot, twostep performs three tasks:

It fits the microlevel models separately for each macrolevel unit.

It sorts the estimated coefficients or summary statistics according to the predefined or a user-specified criterion.

It draws a horizontally labeled dot chart of the coefficients or summary statistics. If coefficients are displayed, the dots are amended by confidence intervals.

The command shares the syntax of all macrolevel commands and allows the following specific options:

ciopts(options) defines the appearance of the confidence intervals (if any). options can be any of the options allowed for twoway rcap. Among these options, lcolor(), and lwidth() may be particularly useful.

scopts(options) defines the appearance of the dots. options can be any of the options allowed for twoway scatter. Among these options, msymbol(), msize(), and mcolor() may be particularly useful.

twoway_options are options allowed for [G-2] graph twoway.

3.2.6 cmd

Besides the more specific commands for the second step, twostep also allows users to specify any Stata command expecting a varlist immediately after the command. To be used as a macrolevel command, the first variable of that varlist must refer to macrodepvar.

While this feature may be used as a kind of fallback mode for analyses not implemented directly into twostep, the most practical use is to create distributional diagnostic plots for the coefficients of the microlevel model; see [R] Diagnostic plots.

3.2.7 mk2nd

The macrolevel command mk2nd creates macrolevel datasets containing selected estimated coefficients, their standard errors, and selected macrolevel variables. This is especially useful for performing more customized analyses on the microlevel coefficients and for gaining flexibility in the creation of graphs for the final publication of the results. Because the syntax deviates slightly from the standard syntax of all the other macrolevel commands, we present it here again:

mk2nd macrodepvar | _all [macroindepvars] [using filename] [, clear]

The definitions of the terms are the same as for all the other macrolevel commands. The major difference is that a user can specify _all instead of macrodepvar, in which case mk2nd writes all coefficients and standard errors of the microlevel models, as well as requested microlevel summary statistics, into the macrolevel data.

We stress that the EDV model can be fit in a dataset created by the macrolevel command mk2nd using the standalone version of edv. The standalone version is used by issuing the command edv without twostep, the declaration or the specification of the microlevel command. The standalone version of edv allows the same options as the macrolevel command edv (see section 3.2.1). It can also be invoked on data not created with mk2nd if the variables holding the coefficients, the standard errors, and the number of observations exist and follow the naming convention used by mk2nd.

Finally, the option clear specifies that it is okay to replace the data in memory, even though the current data have not been saved to disk.

3.3 The special-purpose microlevel commands

In the standard case, twostep expects the specification of an estimation command as microlevel command. However, there are two special-purpose microlevel commands, microcpr and microdfb, which are described here. Both commands have a common syntax:

microcpr depvar varlist [, hascons noconstant tsscons vce(options)]

microdfb depvar varlist [, hascons noconstant tsscons vce(options)]

depvar and varlist define the dependent and independent variables of a linear regression at the microlevel. Note that using twostep with one of these microlevel commands does not require a specification of a certain macrolevel command. However, the macrolevel section must still be used to define the output (see below).

microcpr and microdfb enter into the syntax instead of a microlevel model. Correspondingly, they perform analyses only at the microlevel. Specifically, microcpr does the following:

It separately estimates a linear regression of the specified dependent variable on the specified independent variables for each macrolevel unit.

It draws a component-plus-residual plot of a selected coefficient for each of the microlevel regressions and combines them into one single graph. By default, the various subgraphs are ordered according to the size of the selected coefficient.

microdfb does the following:

It separately estimates a linear regression of the specified dependent variable on the specified independent variables for each macrolevel unit.

It calculates the DFBETAs of a selected coefficient for each of the fitted microlevel regression models.

It draws the box plots of the DFBETAs over the categories of each macrolevel unit.

As mentioned above, in the case of using the special-purpose microlevel commands microcpr and microdfb, twostep does not expect a specific macrolevel command. Instead, it expects the following input in its macrolevel section:

…|| macrodepvar [macroindepvars] [using filename] [, options]

Despite this difference, the specific meanings of terms remain the same as for all the other macrolevel commands. macrodepvar is used to select the independent variable for which the microlevel component-plus-residual plots and the box plots of the DFBETAs are drawn. macroindepvars specifies the order of the subgraphs and boxes and the options that are used to further control the appearance of the graphs. All options available for the macrolevel command cprplot are allowed for microcpr. Many of the options permitted for graph box are applicable for microdfb; for details, please refer to the help file of twostep.

4 Examples

All examples in this section rely on fabricated data mimicking the European Quality of Life Survey 2003 (see European Foundation for the Improvement of Living and Working Conditions, Wissenschaftszentrum Berlin fuer Sozialforschung [2006])—a crossnationally comparative survey carried out in 28 European countries.

We wish to study the effect of household income (hhinc) on general life satisfaction (lsat). Loosely following the theory of the silent revolution (Inglehart 1977), we expect a lower effect of household income on life satisfaction in affluent countries. We approach this question by regressing life satisfaction on a CLI between household income and a country’s gross domestic product per capita, while controlling for sex and age.

4.1 Estimating CLIs

The two-step approach may be used as an alternative to the one-step approach to estimate the coefficients of CLIs. This is particularly useful if the assumptions of the various one-step approaches cannot be satisfied or if the parameters of the correct model of the one-step approach cannot be estimated because of issues of convergence. In the following, we provide Stata commands for fitting the CLI model with each of the four approaches discussed by Bryan and Jenkins (2016b).

A common strategy to estimate CLI would be to use one of the one-step approaches discussed in section 2. For sake of brevity, the following presents the Stata commands for each of the three one-step approaches without showing the output, but see table 1 for a comparison of the key findings from all four approaches. For simplicity, and because it is not unusual in the field of life-satisfaction research (for example, Hagerty and Veenhoven [2003] and Radcliff [2013]), we also rely on linear regression instead of using ordered logit or ordered probit models.

An example of the CLI model on the pooled data with cluster–robust standard errors is

. regress lsat c.hhinc##c.gdppcap i.sex age [pw = wght], vce(cluster cntrynum)

(output omitted )

One of many ways to calculate the model assuming fixed country effects is to use areg. In the following, we forgo the request for cluster–robust standard errors, although this could of course be done.

. areg lsat c.hhinc##c.gdppcap i.sex age [pw = wght], absorb(cntrynum)

(output omitted )

Another possibility is to revert to mixed, which offers great flexibility in setting up random effects for both the intercepts and slopes. Following the recommendations of Heisig, Schaeffer, and Giesecke (2017) and Heisig and Schaeffer (2019), we allow random effects for all microlevel regressors as well as a flexible covariance matrix for these random effects:

. mixed lsat c.hhinc##c.gdppcap i.sex age || cntry: hhinc i.sex age [pw = wght],

> covariance(unstructured)

(output omitted )

As theoretically expected, all models produce a statistically significant negative coefficient for the interaction term. The two-step analogue to estimate the corresponding CLI is the following:

In this output, the coefficient for gdppcap corresponds to the CLI of the one-step approach. The coefficient is negative and statistically significant and is in line with the results of the one-step approaches; see table 1 for a comparison. The sampling variance proportion defined in (12) in section 2.3 is offered below the regression table. We conclude that about half the error variance in the macrolevel model stems from the sampling variance.

Table 1.

Estimated CLI from the four approaches

Approach	Coefficient	Standard error	t
Cluster–robust	−0.0322371	0.0100106	−3.220
Fixed effects	−0.0213635	0.0045595	−4.685
Random effects	−0.0248544	0.0086649	−2.868
Two-step approach	−0.0235947	0.0071744	−3.289

Note that separate models must be run if one is interested in the CLIs with other microlevel variables. However, one can estimate the CLI of one microlevel variable with several macrolevel variables in one model. To illustrate the flexibility of twostep, we also show a command that uses an ordered probit model⁵ at the microlevel and two macrolevel covariates for the CLI (gdppcap and urbanpop). We also add some options to both the microlevel and the macrolevel commands.

. twostep cntry: oprobit lsat c.hhinc i.sex age [pw = wght], vce(cluster cntrynum)

> || edv _b_hhinc gdppcap, method(ols) vce(robust)

(output omitted )

4.2 EDA

The two-step approach is very useful for EDA. The two variants of the component-plusresidual plots may be used to point to nonlinear associations in both the microlevel models (microcpr) and the macrolevel model (cprplot). Moreover, the microlevel command microdfb and the macrolevel added variable plot (avplot) may be used to detect influential cases in the microlevel and macrolevel analyses, respectively. Finally, the regby plot proposed by Bowers and Drake (2005) is specifically designed as a tool for EDA.

We stress that there is a large overlap between the kind of EDA presented here and the descriptive graphs and regression diagnostics presented in the next sections. More specifically, all the examples in this section can also be seen as regression diagnostics and as powerful devices for presenting the results. However, in the context of EDA, we consider the graphs a tool for exploration only, not meant to be presented to a larger audience. Thus, we primarily present the graphs here in their default settings. To use them for presentation, we suggest to use options similar to those shown in one of the examples of section 4.3.

4.2.1 Microlevel component-plus-residual plot

Building on our running example, we use the following command to create a componentplus-residual plot for the regressor hhinc in the microlevel regressions (figure 1). The command is very similar to the command for the EDV model above. We use cntryiso in the declaration section to prevent overplotting of the subgraph titles, exchange regress with microcpr in the microlevel command, and remove the macrolevel command. Specifying the name of the coefficient in twostep’s macrolevel section is done to select the covariate for which the plot is shown and to define the order of the subgraphs.

Figure 1.

Component-plus-residual plot for the regression coefficient of household income for all countries

The purpose of figure 1 is to examine whether the linearity assumptions invoked for the microlevel models hold. To detect deviations, we should particularly search for countries in which the LOWESS lines (Cleveland 1979) deviate strongly from the within regression lines in dense data regions; deviations in sparse regions maybe simply result from outliers. A close inspection of the graph suggests some suspicious deviations in countries such as Slovakia, Estonia, Hungary, and others. Such deviations indicate the need for further inspection of the data in these countries, as do the strong deviations that might be attributable to outliers. A very close inspection also reveals that the dispersion of household income in Bulgaria is very small, except for one outlier.

4.2.2 Microlevel plot of DFBETAs

The finding of a potentially high-influence point for the microlevel regression in the case of Bulgaria can be further investigated using twostep’s special-purpose microlevel command microdfb. The command shows box plots for selected DFBETAs of the microlevel models; see [R] regress postestimation. DFBETAs measure an observation’s influence on the estimated regression coefficient by calculating the change in the coefficient after the removal of the observation from the data.

In its simplest case, we need to replace only the microlevel command microcpr with microdfb. In our example shown in figure 2, however, we also had to remove the weights because DFBETAs cannot be calculated for weighted regressions:

Figure 2.

Box plots of DFBETAs for household income in the microlevel models by country

The figure shows box plots for the DFBETAs of household income in the microlevel models by country. The boxes themselves are barely visible because of one extraordinarily influential data point in the Bulgarian data. Issuing the command with the macro level option marker(),

. twostep cntryiso: microdfb lsat hhinc i.sex age || _b_hhinc, marker(mlabel(id))

(output omitted )

reveals that this influential data point has the ID-number 13327. It may be worth noting that fitting the EDV model without that data point reduces the sampling variance proportion of the EDV model substantially, while considerably increasing the absolute value of the CLI term:

4.2.3 Macrolevel added-variable plot

The EDV model has indicated that the effect of household income on life satisfaction decreases with a country’s prosperity. Because the estimation of the CLI term relies on only 28 observations, we should study whether this effect merely results from few highinfluential data points. A standard tool to study this is the “added-variable plot”—also known as the partial-regression leverage plot, partial regression plot, or adjusted partial residual plot; see [R] regress postestimation diagnostic plots.

The twostep macrolevel command avplot implements the added-variable plot for the macrolevel model. The syntax is almost identical to that of the EDV model. For the running example, we simply exchange edv with avplot in the macrolevel command; see figure 3:

Figure 3.

Added-variable plot for GDP per capita for the EDV model

The figure shows the added-variable plot for the macrolevel model of the coefficients of household income on GDP per capita. Because the macrolevel model has only one covariate, the plot is a bivariate scatterplot of the mean-centered regression coefficients on mean-centered GDP per capita. The regression line has the slope of the coefficient of the default EDV model, that is, edv with option method(fgls1). The size of the marker symbols is proportional to the weights used for the EDV model. Observations with both a high leverage (that is, high and low x values) and a high deviance (that is, far away from the regression line) require attention. In this example, Bulgaria is most suspicious, but the weight of Bulgaria is very small. This is because the microlevel regression coefficient for Bulgaria has a large standard error, which is due to the outlier identified above.

4.2.4 Macrolevel component-plus-residual plot

In the EDV model, the effect of household income on life satisfaction is estimated to linearly decrease with a country’s prosperity. However, it is possible that this CLI effect is actually nonlinear, so that the linear model provides a weak description for the way in which a country’s prosperity affects the effect of a person’s household income on their life satisfaction.

A common tool to investigate nonlinearity for multiple regression is the componentplus-residual plot—also known as the partial residual plot; see [R] regress postestimation diagnostic plots. The macrolevel command cprplot creates such a plot for a specified macrolevel model. The syntax is identical to the syntax of the macrolevel added-variable plot. Thus, for figure 4 we simply have to replace avplot with cprplot:

Figure 4.

The component-plus-residual plot for GDP per capita for the EDV model

Because the macrolevel model has only one covariate, the figure is a bivariate scatterplot of the microlevel regression coefficients centered on the constant of the macrolevel regression on GDP per capita. The slope of the regression line is equal to the coefficient of gdppcap in the macrolevel model. The nonparametric regression curve (LOWESS) shows some indication of a nonlinear relationship. However, this might be attributable only to the potentially high-influence points of Bulgaria, Lithuania, and the Czech Republic.

4.2.5 The regby plot

The macrolevel command regby creates a plot of microlevel regression slopes by groups defined using macrolevel variables. This graph was proposed by Bowers and Drake (2005) for EDA of hierarchical data. It is implemented in the macrolevel command regby, with a syntax very similar to the previously discussed macrolevel commands; see figure 5:

Figure 5.

Regression lines of household income on life satisfaction by groups of GDP per capita

The command creates two groups of countries, first the countries with a GDP per capita below the median, second the countries above the median. The graph shows the regression slopes for household income in the microlevel models. Ideally, the slopes should be similar within each group and dissimilar between groups. To ease visual inspection, gray-shaded lines in the background show the slope for all countries. In this example, we see that the more affluent countries quite homogeneously have smaller coefficients for household income when compared with the poorer countries, whose coefficients are also more heterogeneous.

Variants of the plot allow users to control the grouping in various ways. This can be done by adding further macrovariables to macroindepvars or by using the options discrete(), nquantiles(), and microby.

4.3 Descriptive graphs

Instead of summarizing the results of the microlevel models with a regression at the second step, the two-step approach also suggests presenting the key parameters of the microlevel models graphically. In twostep, the primary workhorse to this end is the macrolevel command dot. It creates horizontally labeled dot charts of a selected estimated coefficient with confidence intervals, as proposed by Bowers and Drake (2005, figure 1). The basic syntax resembles the syntax of all other macrolevel commands of twostep. Thus, the user simply has to alter the name of the macrolevel command. For the running example, the easiest way of specifying this command would be

. twostep cntry: regress lsat hhinc i.sex age [pweight = wght]

> || dot _b_hhinc

(output omitted )

In the following, however, we present the plot after tweaking it with some common options (figure 6):

Figure 6.

Dot chart with options to tweak the look and feel

Apart from using Stata’s standard twoway and graph options, the most important change to the former command is that we added the variable eu15 to the macroindepvars, which leads to sorting the dots into groups of countries by old and new European Union member countries. Within the groups, the dots are arranged in descending order by estimated coefficients, which is the default sorting. Adding yet another variable would change the order within groups of eu15.

The prime observation of figure 6 certainly is that all regression coefficients are larger than zero, with Bulgaria being exceptional in having both the largest effect of all countries and very large standard errors. Thus, personal affluence increases life satisfaction everywhere. We also see that the effect of household income on life satisfaction is on average a bit larger in the new EU member countries than in the older EU member countries. Finally, the coefficients vary slightly more among the new member countries than among the old member countries. Moreover, the confidence intervals tend to be larger in the former group. This creates suspicion of heteroskedasticity in the corresponding EDV model.

Note that the macrolevel command dot—like any other of twostep’s macrolevel commands—allows using to deal with the macrolevel data stored in a separate file. In the following example, the dots are ordered by a country’s corruption index, which is stored in aggregates.dta but not in the European Quality of Life Survey:

4.4 Diagnostics

The two-step approach can also be used to study the formal assumptions invoked when performing multilevel analysis using the one-step approach. A standard way to check the normality assumption for a random coefficient in a mixed model is to predict the random effects and check their distribution with, say, qnorm. Figure 7 does this to check the random coefficient for household income:

Figure 7.

Quantiles of the predicted random effects for the household income coefficients against quantiles of a normal distribution

Note, though, that these random coefficients are estimated under the assumption that the parameterization of the model is correct. The two-step approach allows to check the same assumption without making a distributional assumption beforehand. Checking this assumption can be done by using a distributional diagnostic plot of the estimated parameters of the microlevel model. The twostep command for this is quite straightforward because twostep allows to invoke all plots described in [R] regress postestimation diagnostic plots as macrolevel commands; see figure 8.⁶

Figure 8.

Quantiles of country-specific regression coefficients for household income against quantiles of a normal distribution

The figure reveals that there are more very small and very large effects of household income than what can be expected from a normal distribution. This corroborates the notion that the equivalent random-effects models may underestimate the standard error of the CLI.

Technically speaking, the distributional diagnostic plots are examples of twostep’s fallback mode. The fallback mode allows to invoke any Stata command as a macrolevel command, provided that they are invoked with a varlist in which the first variable refers to a microlevel coefficient or a microlevel summary statistic and any other variables are macrolevel covariates. Thus, twostep allows macrolevel analyses neither described here nor in the help file.

Another issue is the case of more fine-tuned regression diagnostics of the EDV model. As stated above, twostep does not allow predictions from the macrolevel model. Therefore, twostep comes with the standalone command edv, which can be invoked separately in a dataset created by the macrolevel command mk2nd. The command mk2nd creates a macrolevel dataset with all the coefficients and standard errors from the microlevel model and requested macrolevel variables:

. twostep cntry: regress lsat hhinc i.sex age [pw = wght] || mk2nd _all gdppcap,

> clear

(output omitted )

This allows any kind of follow-up analysis on the macrolevel data. The EDV model can be invoked in the created dataset by using edv as a standalone command:

Because edv is an estimation command, we can proceed with predictions as usual, for example:

5 Conclusions

The two-step approach is a powerful addition to the statistical toolbox for analyzing hierarchical data. It is particularly useful if the number of microlevel observations within each macrolevel is large and the number of observations at the macrolevel is small. In such situations, the two-step approach can be used to estimate coefficients of CLIs and thus as an alternative to common one-step approaches such as cluster-corrected standard errors or fixed- and random-effects models on the pooled data. More importantly, however, the two-step approach gives direct access to less formal techniques of EDA that may provide additional insights into the research question at hand. The command twostep is designed to ease these additional applications of the two-step approach.

There are some limitations for the applied researcher when using twostep. First, twostep is primarily intended to analyze hierarchical data with two levels, that is, one microlevel and one macrolevel. While this is the classical textbook case, the hierarchical data to be analyzed might consist of more than two levels. For example, the data might contain information on individuals nested in countries, but for each country there are several repeated cross-sections. In this case, the macrolevel itself can be regarded as pooled cross-sectional (time-series) data, and thus the data represent hierarchical information on three levels (individuals, countries, time). For these kinds of data, twostep might be a starting point, but twostep’s full functionality does not easily transfer to hierarchical data with more than two levels. It is worth noting, though, that cluster–robust standard errors can be requested in the second estimation step of twostep to take account for clustering at higher levels. Second, as the two-step approach is intended to estimate coefficients of CLIs, twostep does not automatically provide point estimates and corresponding standard errors of the average effects of the microlevel characteristics included in the model. However, users interested in these average effects can estimate separate constant-only macrolevel edv models for some or even all microlevel characteristics.

7 Programs and supplemental material

Supplemental Material, sj-zip-1-stj-10.1177_1536867X241257801 - Two-step analysis of hierarchical data

Supplemental Material, sj-zip-1-stj-10.1177_1536867X241257801 for Two-step analysis of hierarchical data by Johannes Giesecke and Ulrich Kohler in The Stata Journal

Footnotes

6 Acknowledgments

twostep with edv is based on edvreg, originally written by Jeffrey Lewis, with contributions by Eduardo Leoni. Jeffrey Lewis (UCLA) gave valuable suggestions for the EDV model. We wish to thank Kekeli Abbey, Lena Hipp, and Armin Sauermann for beta testing. We also want to particularly thank Stephen Jenkins and the anonymous reviewer of the Stata Journal for their critical remarks and tremendous input, which made this article and the programs much better. Ulrich Kohler wishes to thank the participants of summer’s 2017 and winter’s 2020/21 multilevel seminar at the University of Potsdam for commenting on earlier versions of twostep.

7 Programs and supplemental material

To install the software files as they existed at the time of publication of this article, type

Notes

References

Achen

C. H.

2005. Two-step hierarchical estimation: Beyond regression analysis. Political Analysis 13: 447–456. https://doi.org/10.1093/pan/mpi033.

Allison

P. D.

1999. Comparing logit and probit coefficients across groups. Sociological Methods and Research 28: 186–208. https://doi.org/10.1177/0049124199028002003.

Borjas

G. J.

Sueyoshi

G. T.

1994. A two-stage estimator for probit models with structural group effects. Journal of Econometrics 64: 165–182. https://doi.org/10.1016/0304-4076(94)90062-0.

Bowers

Drake

K. W.

2005. EDA for HLM: Visualization when probabilistic inference fails. Political Analysis 13: 301–326. https://doi.org/10.1093/pan/mpi031.

Bryan

M. L.

Jenkins

S. P.

2016a. Multilevel modelling of country effects: A cautionary tale. European Sociological Review 32: 3–22. https://doi.org/10.1093/esr/jcv059.

Bryan

M. L.

Jenkins

S. P.

2016b. Supplementary material to “Multilevel modelling of country effects: A cautionary tale”. European Sociological Review 32: 3–22. https://doi.org/10.1093/esr/jcv059.

Card

1995. The wage curve: A review. Journal of Economic Literature 33: 785–799.

Cleveland

W. S.

1979. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association 74: 829–836. https://doi.org/10.1080/01621459.1979.10481038.

Donald

S. G.

Lang

2007. Inference with difference-in-differences and other panel data. Review of Economics and Statistics 89: 221–233. https://doi.org/10.1162/rest.89.2.221.

10.

European Foundation for the Improvement of Living and Working Conditions, Wissenschaftszentrum Berlin fuer Sozialforschung. 2006. European Quality of Life Survey, 2003. [Data collection]. U.K. Data Service. SN: 5260. https://doi.org/10.5255/UKDA-SN-5260-1.

11.

Hagerty

M. R.

Veenhoven

2003. Wealth and happiness revisited—Growing national income does go with greater happiness. Social Indicators Research 64: 1–27. https://doi.org/10.1023/A:1024790530822.

12.

Hanushek

E. A.

1974. Efficient estimators for regressing regression coefficients. American Statistician 28: 66–67. https://doi.org/10.1080/00031305.1974.10479073.

13.

Hayes

R. J.

Moulton

L. H.

2017. Cluster Randomised Trials. 2nd ed. New York: Chapman and Hall/CRC. https://doi.org/10.4324/9781315370286.

14.

Heisig

J. P.

Schaeffer

2019. Why you should always include a random slope for the lower-level variable involved in a cross-level interaction. European Sociological Review 35: 258–279. https://doi.org/10.1093/esr/jcy053.

15.

Heisig

J. P.

Schaeffer

Giesecke

2017. The costs of simplicity: Why multilevel models may benefit from accounting for cross-cluster differences in the effects of controls. American Sociological Review 82: 796–827. https://doi.org/10.1177/0003122417717901.

16.

Hox

J. J.

Moerbeek

van de Schoot

2018. Multilevel Analysis: Techniques and Applications. 3rd ed. New York: Routledge. https://doi.org/10.4324/9781315650982.

17.

Inglehart

1977. The Silent Revolution. Changing Values and Political Styles Among Western Publics. Princeton, NJ: Princeton University Press. https://doi.org/10.1515/9781400869589.

18.

Kedar

Shively

W. P.

2005. Introduction to the special issue. Political Analysis 13: 297–300. https://doi.org/10.1093/pan/mpi027.

19.

King

1997. A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from Aggregate Data. Princeton, NJ: Princeton University Press. https://doi.org/10.1515/9781400849208.

20.

Lewis

J. B.

Linzer

D. A.

2005. Estimating regression models in which the dependent variable is based on estimates. Political Analysis 13: 345–364. https://doi.org/10.1093/pan/mpi026.

21.

Radcliff

2013. The Political Economy of Human Happiness: How Voters’ Choices Determine the Quality of Life. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9781139344371.

22.

Raudenbush

S. W.

Bryk

A. S.

2002. Hierarchical Linear Models: Applications and Data Analysis Methods. 2nd ed. Thousand Oaks, CA: Sage.

23.

Saxonhouse

G. R.

1976. Estimated parameters as dependent variables. American Economic Review 66: 178–183.

24.

Wooldridge

J. M.

2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cambridge, MA: MIT Press.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.40 MB

0.00 MB