Multiple Imputation of Squared Terms

Abstract

We propose a new multiple imputation technique for imputing squares. Current methods yield either unbiased regression estimates or preserve data relations. No method, however, seems to deliver both, which limits researchers in the implementation of regression analysis in the presence of missing data. Besides, current methods only work under a missing completely at random (MCAR) mechanism. Our method for imputing squares uses a polynomial combination. The proposed method yields both unbiased regression estimates, while preserving the quadratic relations in the data for both missing at random and MCAR mechanisms.

Keywords

multiple imputation polynomial combination quadratic relation regression estimate squared terms

Introduction

Multiple imputation (MI) is the method of choice for many incomplete data problems. MI incorporates the uncertainty about the missing data by creating $m > 2$ imputed data sets. Missing values are filled in under an imputation model. The imputed data that result from the imputation model is then analyzed by the analysis model. Separate analyses can be combined to get a single inference or set of estimates by making use of the combining rules derived by Rubin (1987).

The most critical part of MI is specification of the imputation model. It is widely accepted that the imputation model should embrace all relations of scientific interest. Usually, this is done by incorporating the variables of interest as main factors. However, things become less clear if the scientific model contains nonlinear terms.

As an example, if we want to predict Y from X and its square $X^{2}$ , then both X and $X^{2}$ should be included in the imputation model. Leaving the term $X^{2}$ out of the imputation model will result in a downward bias of the slopes when we perform a regression analysis on the imputed data. However, although it is generally agreed that all squares and interactions should be accounted for in MI, no consensus on how to do this has been reached.

Von Hippel (2009) reviewed several approaches to imputing squares. The “transform, then impute” method calculates the squares and interactions in the incomplete data for the cases that have no missing values, and then imputes the derived variable like any other variable. The “impute, then transform” method imputes variables in their raw form, and then calculates the derived variable in the imputed data after imputation. These methods were compared to the passive imputation method (Van Buuren, Boshuizen, and Knook 1999), implemented in the mice package in R (Van Buuren and Groothuis-Oudshoorn 2011), and the ice command for Stata (Royston 2005).

Von Hippel (2009) advises to use the transform-then-impute method, which delivers acceptable regression estimates but heavily distorts the relationship between X and $X^{2}$ . Figure 1 shows that for the transform-then-impute method, imputations do not follow the relation in the population (observed) data. We agree with Von Hippels conclusion, but do not want to overlook that the transform-then-impute method yields combinations of imputed values that would never occur, had the data been observed. Such imputations are implausible and should be rejected on that ground.

Figure 1.

Transform-then-impute imputations. Observed (blue) and imputed values (red) for X and $X^{2}$ .

We must note that Von Hippels conclusions are based on a missing completely at random (MCAR) mechanism (Seaman, Bartlett, and White 2012), where the missingness does not depend on the data, which is a limitation in practice. An imputation method would be more powerful if it yields acceptable inference under the missing at random (MAR) mechanism, where the missingness may depend on the data, but must not depend on the missing data itself.

Because existing methods for imputing squared terms are severely limited, we propose the polynomial combination approach, which yields unbiased regression estimates, while preserving the consistency between the imputed values, for MAR and MCAR mechanisms.

Method

Formulation of the Problem

The model of scientific interest is

Y = α + X β_{1} + X^{2} β_{2} + ϵ,

with

ϵ ~ N (0, σ^{2})

. We assume that Y is complete and that

X = (X_{o b s}, X_{m i s})

and

X^{2} = (X_{o b s}^{2}, X_{m i s}^{2})

are partially missing. The problem is to find imputations for X such that estimates of α, β₁, β₂, and σ² are unbiased, while ensuring that the quadratic relation between X and

X^{2}

will also hold in the imputed data.

Polynomial Combination Method

Define the polynomial combination $Z = (Z_{o b s}, Z_{m i s})$ as the linear combination $Z = X β_{1} + X^{2} β_{2}$ . The idea is to impute the missing values in Z instead of X and $X^{2}$ , followed by decomposing the imputed data Z into components X and $X^{2}$ . Imputing Z reduces the multivariate imputation problem to a univariate problem, which is easier to manage.

Under the assumption that $P (Y, Z)$ is multivariate normal, we can impute the missing part of Z as $Y β^{*} + ϵ^{*}$ . Here $β^{*}$ is a random draw from the posterior distribution of the linear regression of Y on Z, and $ϵ^{*}$ is a draw from the residual distribution $Z - Y \hat{β}$ . In cases where the normal residual distribution is unrealistic, we can use predictive mean matching (PMM; Little 1988).

The next step is to decompose Z into X and $X^{2}$ . Under model (1) this is straightforward. The imputed value Z has two distinct real roots:

X_{-} = - \frac{1}{2 β_{2}} (\sqrt{4 β_{2} Z + β_{1}^{2}} + β_{1}) .

X_{+} = \frac{1}{2 β_{2}} (\sqrt{4 β_{2} Z + β_{1}^{2}} - β_{1}),

where the discriminant

Δ = 4 β_{2} Z + β_{1}^{2}

must be greater than 0. The case

Δ = 0

occurs if and only if both β₁ and β₂ are exactly 0, resulting in just one distinct real root, namely

X_{0} = - β_{1} / 2 β_{2}

. Since incorporating nonexistent relationships in the analysis serves no further purpose, we assume that regression estimates are always unequal to 0.

Given this assumption, for any given Z, we can take either $X$ = $X_{-}$ or $X$ = $X_{+}$ , and square it to obtain $X^{2}$ . Either root is consistent with $Z = X β_{1} + X^{2} β_{2}$ , but choice among these two options requires care. Note that the minimum of the parabola is located at $X_{m i n} = - β_{1} / 2 β_{2}$ . If we choose $X_{-}$ for all Z, then all imputed $X \leq X_{m i n}$ will correspond to points located on the left arm of the parabolic function. This is generally not as intended. A sampling mechanism to determine whether to choose from $X_{-}$ or $X_{+}$ for a given Z is needed.

The choice between the roots is made by random sampling, conditional on Y, Z, and their interaction YZ. Let $V = (V_{o b s}, V_{m i s})$ , where $V_{o b s}$ is a binary random variable defined as 0 if $X_{o b s} \leq X_{m i n}$ and 1 otherwise. We model the probability $P (V = 1)$ by logistic regression as

l o g i t P (V = 1) = Y β_{Y} + Z β_{Z} + Y Z β_{Y Z},

on the observed data. Assuming that the same model applies to the missing values in X (i.e., that the missingness mechanism is ignorable), we calculate the predicted probability

P (V = 1)

. As a final step, a random draw from the binomial distribution is made, and the corresponding (negative or positive) root is selected as the imputation. This is repeated for each missing value.

Imputation Algorithm

The procedure leads to the following algorithm for imputing squares:

Calculate $X_{o b s}^{2}$ for the observed X.

Use PMM to multiply impute $X_{m i s}$ and $X_{m i s}^{2}$ as if they were unrelated, resulting in imputations $X^{*}$ and $X^{* 2}$ .

Estimate the pooled estimates ${\hat{β}}_{1}$ and ${\hat{β}}_{2}$ by linear regression of Y, given $X = (X_{o b s}, X^{*})$ and $X^{2} = (X_{o b s}^{2}, X^{* 2})$ .

Calculate the polynomial combination $Z = X {\hat{β}}_{1} + X^{2} {\hat{β}}_{2}$ .

Multiply impute $Z_{m i s}$ by PMM, resulting in imputations $Z^{*}$ .

Calculate roots $X_{-}$ and $X_{+}$ given ${\hat{β}}_{1}$ , ${\hat{β}}_{2}$ , and $Z^{*}$ using equations (2) and (3).

Calculate the abscissa at the parabolic minimum/maximum $X_{m i n} = - {\hat{β}}_{1} / 2 {\hat{β}}_{2}$ .

Calculate $V_{o b s} = 0$ if $X_{o b s} \leq X_{m i n}$ , else $V_{o b s} = 1$ .

Impute $V_{m i s}$ by logistic regression of V given Y, Z, and YZ, resulting in imputations $V^{*}$ .

If $V^{*} = 0$ , then assign $X^{*} = X_{-}$ , else set $X^{*} = X_{+}$ .

Calculate $X^{* 2}$ .

The imputations $Z^{*}$ will satisfy $Z^{*} = X^{*} {\hat{β}}_{1} + X^{* 2} {\hat{β}}_{2}$ .

Results

To illustrate the polynomial combination method, we simulated and compared the performance of all methods discussed by Von Hippel (2009) against the polynomial combination method. Data were generated according to the model $Y = α + X β_{1} + X^{2} β_{2} + ϵ$ , where X is randomly generated from a standard normal distribution. A larger sample size (n = 10,000) was chosen to demonstrate convergence. However, the method works well for smaller sample sizes. We fixed the population intercept α at 0 and the residual standard deviation $σ_{ϵ}$ at 1. Deviations seem to be larger when the slope of both X and $X^{2}$ are larger, hence the population slopes $β_{1}$ and $β_{2}$ were set to 1. Let R be a response indicator with

R = \{\begin{matrix} 1 & i f X i s o b s e r v e d \\ 0 & i f X i s m i s s i n g \end{matrix},

and let

Z_{m i s}

denote the missing values in Z. Given these settings we created 50 percent joint missingness in X and

X^{2}

according to four MAR mechanisms that follow

P (R = 0 | Z_{o b s}, Z_{m i s}, Y) = P (R = 0 | Z_{o b s}, Y),

using a random draw from a binomial distribution of the same length as Y and of size 1 with missingness probability equal to the inverse logit

P (R = 0) = \frac{e^{a}}{(1 + e^{a})} .

Setting

a = (- \overset{ˉ}{X} + X_{i}) / S D_{X}

gives 50 percent left-tailed MAR missingness. Right-tailed, centered and tailed MAR missingness can be created by setting

a = (\overset{ˉ}{X} - X_{i}) / S D_{X}

a = .75 - [(\overset{ˉ}{X} - X_{i}) / S D_{X}]

and

a = - .75 + [(\overset{ˉ}{X} - X_{i}) / S D_{X}]

, respectively. Adding or substracting a constant moves the sigmoid curve, which results in different missingness proportions.

As an analysis, we used linear regression to see whether the population values could be estimated after imputation. We repeated the analyses 100 times.

The regression estimates after applying the polynomial combination imputation can be found in Table 1. The estimated coefficients of the imputed X and $X^{2}$ , the coefficient of the intercept α and the residual standard deviation $σ_{ϵ}$ are all close to their respective population values. Missingness mechanisms that involve the right tail show slightly larger deviations.

Table 1.

Average Parameter Estimates for Different Imputation Methods Under Five Different Missingness Mechanisms Over 100 Imputed Data Sets (n = 10,000) With 50 Percent Missing Data.

	Missingness Mechanism
	MCAR	MARleft	MARmid	MARtail	MARright
Polynomial combination
Intercept (α)	0	−0.01	−0.01	−0.05	−0.07
Slope of X (β₁)	1	1	1	0.96	0.96
Slope of X ² (β₂)	1	1	1.01	1.06	1.09
Residual SD (σ_∊)	1	1	1	1.03	1.05
R ²	.75	.75	.75	.73	.73
Impute, then transform
Intercept (α)	0.39	0.29	0.26	0.52	0.56
Slope of X (β₁)	0.93	0.94	0.87	1.01	1.06
Slope of X ² (β₂)	0.61	0.60	0.67	0.56	0.66
Residual SD (σ_∊)	1.48	1.44	1.41	1.56	1.62
R ²	.45	.48	.50	.39	.34
Passive imputation
Intercept (α)	0.39	0.29	0.26	0.52	0.56
Slope of X (β₁)	0.93	0.94	0.87	1.01	1.05
Slope of X ² (β₂)	0.61	0.60	0.68	0.56	0.66
Residual SD (σ_∊)	1.48	1.45	1.41	1.57	1.62
R ²	.45	.48	.50	.38	.34
Transform, then impute
Intercept (α)	0	0.19	−0.13	0.01	−0.05
Slope of X (β₁)	1	0.91	0.97	1.14	1.32
Slope of X ² (β₂)	1	0.91	0.95	1.14	1.32
Residual SD (σ_∊)	1	0.95	1	1.06	1.15
R ²	.75	.77	.75	.72	.67

Note: The population parameters are $α = 0$ , $β_{1} = 1$ , $β_{2} = 1$ , $σ_{ϵ} = 1$ , and $R^{2} = .75$ .

In contrast, Table 1 also displays the performance of the impute-then-transform method regression estimates under the same simulation conditions. The impute-then-transform method yields biased regression estimates, even under MCAR.

Table 1 also shows the performance of the passive imputation method. Passive imputation performance is similar to the problematic performance of the impute-then-transform method, as both methods calculate $X^{2}$ afterward.

Finally, the transform-then-impute method yields unbiased regression estimates, but only for MCAR. Although some estimates are retrieved, performance is severely impaired under the MAR assumption (see Table 1).

All in all, the polynomial combination method yields regression estimates that are both unbiased and preserve the data relation between X and $X^{2}$ . The polynomial combination method also perfectly reproduces the population relation between X and its square $X^{2}$ in the imputed data. See Figure 2 for a graphical representation of the population and imputed data relations between X and $X^{2}$ , as generated by the polynomial combination method.

Figure 2.

Polynomial combination imputation. Observed (blue) and imputed values (red) for X and $X^{2}$ .

We also looked at the mean and covariance matrix as reproduced by the imputed data and compared it to the population values. The mean and covariance matrix of $(X, X^{2}, Y)$ are

μ = [\begin{matrix} 0 \\ 1 \\ β_{2} \end{matrix}] a n d Σ = [\begin{matrix} 1 \\ 0 & 2 \\ β_{1} & 2 β_{2} & 1 + β_{1}^{2} + 2 β_{2}^{2} \end{matrix}] .

A set of k mean values can be pooled to a single residual mean value with

Δ_{μ} = \frac{1}{k} \sum_{i = 1}^{k} | μ_{i} - m_{i} |,

where

m_{i}

is the ith mean value for the imputed data. Likewise, a pooled residual covariance matrix can be created by

Δ_{Σ} = \frac{1}{k} \sum_{i = 1}^{k} | Σ_{i} - S_{i} |,

where

S_{i}

is the ith covariance matrix of the imputed data. Performing a small simulation of n = 100 with various regression weights and combining the results with equations (8) and (9) yields the following pooled residual mean and covariance matrix.

Δ_{μ} = [\begin{matrix} 0.003 \\ - 0.004 \\ - 0.003 \end{matrix}] a n d Δ_{Σ} = [\begin{matrix} - 0.004 \\ 0 & 0.007 \\ - 0.004 & 0 & - 0.012 \end{matrix}] .

The results in equation (10) suggest that the mean and covariance matrix in the population data are accurately preserved in the imputed data. Given that only normal imputations that preserve the mean and covariance matrix from the population data can yield unbiased imputations, we can now confidently say that the polynomial combination method yields unbiased regression estimates and delivers transformed variable imputations that are consistent with each other.

All computations in this study have been carried out in R and all imputations are generated with the mice package in R (Van Buuren and Groothuis-Oudshoorn 2011) with m = 5 multiple imputions. A mice.impute.quadratic routine that implements the polynomial combination method is available in mice.

Conclusion

The polynomial combination method as developed here provides unbiased estimates for problems where incomplete X and $X^{2}$ are both in the complete data model. It merges imputation techniques and decomposition of the quadratic equation to obtain the same unbiased regression estimates as the basic transform-then-impute method, while preserving the relations between X and $X^{2}$ . Also, it performs well under both MCAR and MAR missingness mechanisms. Our advice is to use the polynomial combination method to impute transformed variables with squared relations.

We note that the simulation conditions used are rather harsh. For example, 50 percent of X is missing and some missingness mechanisms severely limit the amount of usable predictive information, especially right-tailed MAR missingness. Also, note that imputations are based on just one covariate. In real-life data sets, conditions for imputing the data are often much better. Yet, also for simpler incomplete data problems, the polynomial combination method yields the best possible inferences even though the difference with the results from other methods may be smaller.

We limited our calculations and analyses to squares, which are essentially interactions between two identical variables. Interactions between different variables remain best imputed using the transform-then-impute method. The polynomial combination method can be generalized to more complex nonlinear combinations. We expect that the proposed method also applies to problems in which the scientifically interesting model contains multiple versions or transformations of X, such as interactions between different variables, higher degree polynomial equations and perhaps even splines, which are essentially piecewise polynomials. Exploring such applications of the polynomial combination method is subject to future work.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Little

1988. “Missing-data Adjustments in Large Surveys.” Journal of Business and Economic Statistics 6:287–96.

Royston

2005. “Multiple Imputation of Missing Values: Update of Ice.” Stata Journal 5:527–36.

Rubin

1987. Multiple Imputation for Nonresponse in Surveys. New York: John Wiley.

Seaman

Bartlett

White

. 2012. “Multiple Imputation of Missing Covariates with Non-linear Effects and Interactions: An Evaluation of Statistical Methods.” BMC Medical Research Methodology 12:46.

Van Buuren

Boshuizen

Knook

. 1999. “Multiple Imputation of Missing Blood Pressure Covariates in Survival Analysis.” Statistics in Medicine 18:681–94.

Van Buuren

Groothuis-Oudshoorn

. 2011. “MICE: Multivariate Imputation by Chained Equations in R.” Journal of Statistical Software 45:1–67.

Von Hippel

2009. “How to Impute Interactions, Squares, and Other Transformed Variables.” Sociological Methodology 39:265–91.