Robust integration of secondary outcomes information into primary outcome analysis in the presence of missing data

Abstract

In clinical and observational studies, secondary outcomes are frequently collected alongside the primary outcome for each subject, yet their potential to improve the analysis efficiency remains underutilized. Moreover, missing data, commonly encountered in practice, can introduce bias to estimates if not appropriately addressed. This article presents an innovative approach that enhances the empirical likelihood-based information borrowing method by integrating missing-data techniques, ensuring robust data integration. We introduce a plug-in inverse probability weighting estimator to handle missingness in the primary analysis, demonstrating its equivalence to the standard joint estimator under mild conditions. To address potential bias from missing secondary outcomes, we propose a uniform mapping strategy, imputing incomplete secondary outcomes into a unified space. Extensive simulations highlight the effectiveness of our method, showing consistent, efficient, and robust estimators under various scenarios involving missing data and/or misspecified secondary models. Finally, we apply our proposal to the Uniform Data Set from the National Alzheimer’s Coordinating Center, exemplifying its practical application.

Keywords

Data borrowing empirical likelihood missing data multiple imputations secondary outcomes

1. Introduction

Clinical trials and observational studies often collect secondary outcomes alongside the primary outcome. These secondary outcomes can be highly associated with the primary outcome and therefore contain information that could improve the primary model efficiency. In our study, we use the Uniform Data Set from the National Alzheimer’s Coordinating Center (NACC) to explore risk factors on the development of short-term dementia. Thus, the primary outcome of clinical interest here is a cross-sectional variable such as three-year dementia status. Along with the primary outcome, some secondary outcomes are also collected during the study. For example, one of the diagnosis tools, the Mini Mental State Examination (MMSE), is implemented to gather information on overall cognitive ability and monitor the progression.^1,2 Additionally, the clinical dementia rating scale is another informative secondary outcome that is used in the definition of dementia status.³ Typically, secondary outcomes are analyzed separately or only baseline observations are used for the primary regression analysis. These approaches fail to fully consider the high association between the outcomes or incorporate information contained in these secondary outcomes that are longitudinally measured. Efficient integration of secondary outcomes into the primary model fitting is desired to further boost the statistical performance of estimating risk factors on the primary outcome. However, naive solutions such as including both baseline variables and secondary outcomes as covariates could be inappropriate or misleading. Since both primary and secondary outcomes might be affected by risk factors in a similar manner. Thus, adding secondary outcomes to the primary model can dilute or even eliminate all effects from the baseline, which can render the primary model less clinically interpretable. This phenomenon is known as collider bias in causal inference. So far, few work on how to integrate secondary outcomes into the primary outcome analysis in an effective and robust manner in the literature exists. One of our study goals is to fill this gap and proposes novel approaches that effectively integrate secondary outcomes into the primary analysis to enhance the estimation efficiency.

Here, we consider auxiliary information collected within the same study, which is distinct from the information obtained from external independent studies. While there are several information borrowing techniques developed for independent studies, such as constrained maximum likelihood using summary information from independent studies,^4,5 calibration estimation utilizing some shared effects across studies,⁶ and Bayesian integration introducing external information via prior distributions,⁷ there is relatively few work concerning within-study auxiliary information. Recently, Chen et al.⁸ proposed an empirical likelihood-based method for incorporating secondary outcomes to improve estimation precision of the primary analysis. However, this work only applies to cases where the primary data are fully observed or missing completely at random (MCAR) that is easily to be violated in real practice including our data application.⁹ These existing methods cannot be directly applicable to our setting due to bias susceptibility. Thus, our another study goal is to develop novel approaches to reduce the bias due to missingness in primary or secondary outcomes.

In the NACC, patients were followed approximately annually, and the diagnosis of dementia was in accordance with published research criteria (e.g. the Diagnostic and Statistical Manual 4th edition). The primary outcome of interest, three-year dementia, could be missing due to subjects being lost to follow-up, missing a visit, or refusing to provide a response, which is commonly encountered in reality. Ignoring or improperly handling missing data could lead to less efficient or even biased inference when the data are not MCAR.⁹ To address these issues, we extend the empirical likelihood-based information borrowing method proposed by Chen et al.⁸ with the adoption of the inverse probability weighting (IPW) technique¹⁰ and the uniform mapping strategy. Specifically, we incorporate the IPW to adjust for missing data in the primary model, and propose the uniform mapping strategy to facilitate unbiased information borrowing by homogenizing incomplete secondary outcomes. By doing so, we obtain an efficiency-improved and robust parameter estimator when the data are missing at random (MAR).⁹ This approach is particularly relevant for our study given the complex features and missingness of the NACC data.

The structure of the article is as follows. In Section 2, we present our proposed method. In Section 3, we perform empirical simulations to verify the consistency, high efficiency, and robustness of our method. In Section 4, we apply our method to the NACC study for illustration. Finally, the conclusions with some discussion are summarized in Section 5. All technical details, regularity conditions, and proofs are presented in the Supplemental Material.

2. Method

For subject $i = 1, 2, \dots, n$ , let $Y$ be the primary outcome of interest with a covariate vector $X$ ; let $\tilde{Y}$ be secondary outcomes that are highly correlated with the primary outcome with a covariate matrix $\tilde{X}$ . Note that $\tilde{X}$ and $X$ could overlap (e.g. baseline risk factors). Motivated by the data application, we consider the setting with $Y$ as cross-sectional data and $\tilde{Y}$ as longitudinal measures. For ease of illustration, we assume all variables in $X$ and $\tilde{X}$ are fully observed first. This assumption can be relaxed later, as we encountered missing covariates in our data application study (see Sections 4 and 5 for more details). Besides, we assume the outcomes $Y$ and $\tilde{Y}$ are MAR, which is commonly and reasonably applied in many circumstances. Let $R$ be the observation indicator for the primary outcome with $R = 1$ indicating $Y$ is observed; otherwise $R = 0$ . Similarly, let $\tilde{R}$ be the observation indicator vector, with each element (1 for yes and 0 for no) indicating whether the secondary outcome is observed at each visit time. Thus, we have the observed data for the $i th$ subject, $D_{i}^{p} = {R_{i} Y_{i}, X_{i}, R_{i}}$ and $D_{i}^{s} = {{\tilde{R}}_{i} ⊙ {\tilde{Y}}_{i}, {\tilde{X}}_{i}, {\tilde{R}}_{i}}$ , where $⊙$ denotes element-wise multiplication for vectors, for the primary and secondary analyses, respectively. In the following, we suppress the subscript $i$ based on the independent and identical distribution (i.i.d.) property to simplify the notation when the context is clear.

2.1. Empirical likelihood weighting

Suppose the regression parameter vector of the mean model between $Y$ and $X$ are of main interest and denoted by $β$ . Let $f * (Y, X; β)$ be any estimating function for solving $β$ . It can be a score function, or the derivative of least squares, or the function in a generalized estimating equation.¹¹ Let $β_{0}$ be the true parameter value such that $E {f * (Y, X; β_{0})} = 0$ . In the presence of missing data, one naive estimation of $β$ , which is known as the complete case analysis, is to solve the following estimating equation

\sum_{i = 1}^{n} R_{i} f * (Y_{i}, X_{i}; β) = 0

When the data are MCAR, the corresponding solution, denoted by

{\hat{β}}_{n a i v e}

, is consistent for

β_{0}

under some regularity conditions.^9,12 However, this MCAR assumption can be easily violated in practice, and we will relax this next. In addition to the primary outcome, some subjects may have secondary outcomes that are highly associated with the primary outcome and thus can help improve the estimation precision of

β

. Chen et al.⁸ proposed to extract information from secondary analysis by using an empirical likelihood-based weighting (ELW) technique. They obtained the estimator

{\hat{β}}_{E L W}

by solving the weighted estimating equation

\sum_{i = 1}^{n} {\hat{p}}_{i} R_{i} f * (Y_{i}, X_{i}; β) = 0

where the non-negative weights

{\hat{p}}_{i}

i = 1, 2, \dots, n

are the maximizers of the following constrained function

max \prod_{i = 1}^{n} p_{i}, s . t . \sum_{i = 1}^{n} p_{i} = 1, \sum_{i = 1}^{n} p_{i} h (D_{i}^{s}; θ) = 0

(1)

with

h (D^{s}; θ)

as an estimating function based on some working models for the data

D^{s}

satisfying

E {h (D^{s}; θ_{*})} = 0

for a value

θ_{*}

. The empirical likelihood maximization can be solved by applying a Lagrange multiplier.¹³ The function

h (D^{s}; θ_{*})

implicitly involves the observation indicators

\tilde{R}

so that it allows secondary outcomes to be missing as well. Note that the dimension of

h (D^{s}; θ)

should be greater than that of

θ

, and the resulting estimator

{\hat{β}}_{E L W}

is consistent under some regularity conditions and has been shown to be more efficient than

{\hat{β}}_{n a i v e}

The work by Chen et al.⁸ has two limitations. First, it assumes that the primary data are either fully observed or MCAR, which is too idealistic in practice. Second, it takes for granted the assumption that there exist a value $θ_{*}$ such that $E {h (D^{s}; θ_{*})} = 0$ holds. Although this assumption is mild and may be satisfied in some cases, it could be easily violated when secondary outcomes are MAR. Violation of either of these assumptions could lead to bias in the primary analysis, as demonstrated in Section 3. In the following, we propose a novel and robust estimator to address these limitations.

2.2. The proposed method for handling missing data

When dealing with MAR primary outcome, various standard approaches such as the IPW and imputation techniques have been widely applied in the literature.^14–16 In this article, we focus on the IPW for illustration and rigorous theoretical investigation. Other extensions are discussed in Section 5. We assume that there is a correctly specified model for the observation indicator of the primary outcome, $R$ , with the covariates $Z$ and the associated parameter vector $η$ . Under the MAR assumption, the variables in $Z$ can be any observed variables, such as $X$ , the history of $Y$ , or any other observed measurements not used for the primary or secondary analysis. Let $η_{0}$ be the underlying true parameter value. We have the data from the $i th$ subject, $D_{i}^{m} = {R_{i}, Z_{i}}$ for missing data analysis. Moreover, we first assume that the secondary outcomes $\tilde{Y}$ are fully observed ( $\tilde{R} = 1$ ) for subsections of joint estimation and plug-in estimation. Later, we will show the extension by relaxing this restriction.

2.2.1. Joint estimation

In order to properly use the IPW technique in the context of information borrowing, let us first consider the estimator by solving

\sum_{i = 1}^{n} {\hat{p}}_{i} F (D_{i}^{p}, D_{i}^{m}; β, η) = 0

where

{\hat{p}}_{i}

i = 1, 2, \dots, n

, are defined the same as the maximizers of (1).

F (D^{p}, D^{m}; β, η)

is a full estimating function defined as

F (D^{p}, D^{m}; β, η) = (\begin{matrix} f (D^{p}, D^{m}; β, η) \\ g (D^{m}; η) \end{matrix})

(2)

where

f (D^{p}, D^{m}; β, η)

is the adjusted primary estimating function that takes the form

f (D^{p}, D^{m}; β, η) = \frac{R}{P (R = 1 | Z; η)} f * (Y, X; β)

(3)

where

P (R = 1 | Z; η)

is the probability of observing the outcome, and

g (D^{m}; η)

is the estimating function for the missing data model. We consider this estimating function

F (D^{p}, D^{m}; β, η)

because both estimates of

β

and

η

are in the adjusted primary estimating function. To account for both uncertainties simultaneously and lead to valid information borrowing, a natural way is to jointly estimate these parameters. Thus, we name the resulting estimator as

{\hat{β}}_{F J}

, where FJ represents the estimates based on the full joint estimating function, with its asymptotic property summarized in the following theorem.

Theorem 1

Under some regularity conditions, suppose there exists a parameter $θ_{*}$ such that $E {h (D^{s}; θ_{*})} = 0$ , then we have

n^{1 / 2} ({\hat{β}}_{F J} - β_{0}) \to N (0, V_{E N})

where

V_{E N} = Γ_{11}^{- 1} (Σ_{11} - Σ_{12} Σ_{22}^{- 1} Σ_{21} - Λ_{1} S Λ_{1}^{T}) Γ_{11}^{- 1, T}

with

Γ_{11} = E {\partial f (D^{p}, D^{m}; β_{0}, η_{0}) / \partial β^{T}}

Σ_{11} = E {f (D^{p}, D^{m}; β_{0}, η_{0}) f^{T} (D^{p}, D^{m}; β_{0}, η_{0})}

Σ_{12} = E {f (D^{p}, D^{m}; β_{0}, η_{0}) g^{T} (D^{m}; η_{0})} = Σ_{21}^{T}

Σ_{22} = E {g (D^{m}; η_{0}) g^{T} (D^{m}; η_{0})}

Λ_{1} = E {f (D^{p}, D^{m}; β_{0}, η_{0}) h^{T} (D^{s}; θ_{*})}

, and

S = S_{11}^{- 1} - S_{11}^{- 1} S_{12} (S_{21} S_{11}^{- 1} S_{12})^{- 1} S_{21} S_{11}^{- 1}

, where

S_{11} = E {h (D^{s}; θ_{*}) h^{T} (D^{s}; θ_{*})}

S_{12} = E (\partial h (D^{s}; θ_{*}) / \partial θ^{T}) = S_{21}^{T}

Let ${\hat{β}}_{I P W}$ denotes the estimator obtained by solving $\sum_{i = 1}^{n} F (D_{i}^{p}, D_{i}^{m}; β, η) = 0$ , which is the regular IPW estimator without incorporating ELW. Let $V_{I P W}$ be its asymptotic variance. Examining the formula for $V_{E N}$ , the former part $Γ_{11}^{- 1} (Σ_{11} - Σ_{12} Σ_{22}^{- 1} Σ_{21}) Γ_{11}^{- 1, T}$ is exactly $V_{I P W}$ . Regarding the fact that $S$ is non-negative definite, $V_{E N}$ is not larger than $V_{I P W}$ . This indicates that our proposed estimator is more efficient than ${\hat{β}}_{I P W}$ . Note that $Λ_{1}$ measures the association between the primary and secondary estimating functions. The stronger the association is, the greater the variance reduction is. Thus, we recommend selecting a secondary outcome $\tilde{Y}$ that is highly correlated with the primary outcome $Y$ in practice. Naturally, if $\tilde{Y}$ is uncorrelated with $Y$ given covariates, then $Λ_{1} = 0$ and $V_{E N}$ will reduce to $V_{I P W}$ , indicating that the secondary outcome carries no information about the primary outcome and could not improve the estimation efficiency. A well-selected working model for secondary outcomes can be important as it may capture more information in the secondary outcomes and strengthen the association between the primary and secondary estimating functions. In practice, we recommend using the expanded generalized estimating equation (GEE),¹⁷ and its performance has been assessed through simulation studies and data applications. The detailed specification can be found in the Supplemental Material.

2.2.2. Plug-in estimation

Despite its theoretical advance, the joint estimation may suffer from high estimation complexity as the dimension of the parameters increases. To relieve this issue, alternatively, we consider a two-stage estimation: in the first step, obtain an estimator $\hat{η}$ by solving $\sum_{i = 1}^{n} g (D_{i}^{m}; η) = 0$ ; in the second step, calculate the estimator by solving the following estimating equations

\sum_{i = 1}^{n} {\hat{p}}_{i} f (D_{i}^{p}, D_{i}^{m}; β, \hat{η}) = 0

It can be easily seen that the parameters in the missing data model are solved separately from the primary parameters, which may considerably reduce the estimation complexity compared to the joint estimation. Based on this method of partial estimating equations and plug-in estimation (PP), we have the corresponding estimator as

{\hat{β}}_{P P}

. The asymptotic property of

{\hat{β}}_{P P}

is summarized below.

Theorem 2
Under some regularity conditions, suppose there exists a parameter $θ_{}$ such that $E {h (D^{s}; θ_{})} = 0$ , then we have
$n^{1 / 2} ({\hat{β}}_{P P} - β_{0}) \to N (0, V_{E N})$
where $V_{E N}$ is defined in Theorem 1.

In terms of efficiency, we realize that ${\hat{β}}_{F J}$ and ${\hat{β}}_{P P}$ are asymptotically equivalent. Recall that the equivalence of plug-in and joint estimation of classic IPW estimator is standard, so we do not distinguish them and let the estimator and its asymptotic variance be ${\hat{β}}_{I P W}$ and $V_{I P W}$ , respectively.¹² Here we extend it to the case where the secondary outcomes are involved. An intuitive explanation is that, given all secondary outcomes observed, the estimation in the missing data model becomes pure nuisance and exogenous to the information from the secondary outcomes.
Proposition 1
If the primary outcome is MAR and all secondary outcomes are fully observed, we have (a) $g (D^{m}; η_{0})$ and $h (D^{s}; θ_{*})$ are uncorrelated; (b) ${\hat{β}}_{F J}$ and ${\hat{β}}_{P P}$ are asymptotically equivalent.

The above results hold when all secondary outcomes are observed across times and subjects. When some secondary outcomes are missing, we may consider imputing secondary outcomes before estimation. By doing so, we are back to the case with “observed” data. We refer readers to next subsection for more details. For completeness, we also consider the method of full estimating equations and plug-in estimation (FP), where we plug the estimator $\hat{η}$ into (2) instead of (3). This can be viewed as an intermediate approach between the FJ and PP methods. More details can be found in the Supplemental Material.
2.3. Issues related to secondary outcomes analysis

Robustness to secondary model misspecification: Recall that we require the existence of $θ_{*}$ such that $E {h (D^{s}; θ_{*})} = 0$ . The parameter $θ_{*}$ is not necessarily the true value $θ_{0}$ that generates secondary outcomes, and we are not interested in making inferences from the secondary outcomes. Instead, we only want to extract information from them. Hence, an incorrect $θ_{*}$ can still be useful in the sense that it serves as a bridge to link the primary and secondary analyses. This implies that our estimator is robust to model misspecification of secondary outcomes, ensuring its consistency and greater efficiency under mild conditions, as shown in simulations.

Missing data in secondary outcomes: When there is no missing secondary outcomes, it is common in practice that the $θ_{*}$ exists. However, this situation is slightly idealistic. In real applications, we often observe incomplete secondary outcomes due to dropouts, competing risk events (e.g. death), administrative issues, etc. In the presence of missing secondary outcomes, the unified $θ_{*}$ may not exist based on the observed data. For example, in a longitudinal study, the outcomes are partially observed and the missing rate varies at each time point. Hence, the observed data are not homogeneous across time, and the unified $θ_{*}$ leading to $E {h (D^{s}; θ_{*})} = 0$ over times may not exist. As shown in simulations, there is a significant bias of the classic ELW-based estimator. A well-selected working model for $h (D^{s}; θ)$ may facilitate a unified $θ_{*}$ , however, based on our extensive simulation study, such a model is often difficult to construct and the efficiency gain could be minimal.

To address this issue, we propose using multiple imputations to create complete data. Distinct from imputations in the literature that aims at recovery of the true model, the goal here is to find a unified $θ_{*}$ satisfying $E {h (D^{s}; θ_{*})} = 0$ , but not necessarily the true one. Thus, this relaxing and mild condition motivates us to propose the following mapping strategy, named “uniform mapping.” Specifically, given an imputation model, the uniform mapping first estimates the imputation model parameters using the observed outcomes and covariates; it then generates plausible values for the missing outcomes using the covariates and the posterior distribution of model parameters. In this fashion, we essentially embed the incomplete data to a unified space. This mapping promotes the existence of a unified $θ_{*}$ for secondary outcomes and elevates the robustness of secondary analysis. We emphasize that a unified $θ_{*}$ can be biased to the underlying truth $θ_{0}$ ; nevertheless, the uniform mapping procedure remains useful as long as a unified $θ_{*}$ exists. This appealing property significantly enriches the candidate pool of imputation models and hence brings more flexibility. Furthermore, after the uniform mapping, the secondary outcomes become “complete,” which validates the use of Proposition 1.

Guidance for secondary outcomes analysis: The performance of our proposal depends on many factors. For example, there are two popular imputation frameworks: joint modeling (JM) and multiple imputations by chained equations (MICEs), and there are various specific imputation models within them. Furthermore, there are different choices of working models for $h (D^{s}; θ)$ . Since it is difficult to determine the optimal strategy for secondary analysis, we provide some useful guidance for practical usage: (i) test the equivalence of ELW-based estimator. In spite of all efforts to facilitate a unified $θ_{*}$ , it is still risky to assert its existence in real practice. Thus, we may do a hypothesis testing to evaluate assumption violation. To be specific, if there exists a unified $θ_{*}$ , by examining the influence functions of ${\hat{β}}_{I P W}$ and ${\hat{β}}_{P P}$ derived in the Supplemental Material, we have

n^{1 / 2} ({\hat{β}}_{I P W} - {\hat{β}}_{P P}) \to N (0, Γ_{11}^{- 1} Λ_{1} S Λ_{1}^{T} Γ_{11}^{- 1, T})

and we can conduct a Wald test to evaluate whether

{\hat{β}}_{I P W}

is equivalent to

{\hat{β}}_{P P}

. (ii) Use the index for information borrowing (IIB) to evaluate the performance of precision improvement. Recall that

V_{I P W}

is the asymptotic variance of the estimator without using ELW. Intuitively, the difference between

V_{I P W}

and

V_{E N}

can be served as a measure of precision improvement. This motivates us to consider the IIB statistic as follows:

IIB = t r a c e {d i a g ({\hat{V}}_{I P W})^{- 1} ({\hat{V}}_{I P W} - {\hat{V}}_{E N})} / p

where

{\hat{V}}_{I P W}

and

{\hat{V}}_{E N}

are consistent estimators of

V_{I P W}

and

V_{E N}

, and

p

is the dimension of

β

. Note that the IIB, representing the average proportion of variance reduction, was initially introduced by Chen et al.⁸ In our context, we adjust it by multiplying by a factor of

1 / p

. A larger value of IIB indicates better precision improvement.

3. Simulation

In this section, we examine the numerical performance of our proposed estimator under a variety of scenarios. For each scenario, we consider the sample size $n = 500, 1000, 2000$ and the number of repeated measures is $m = 4$ for the longitudinal follow-up. The parameters are selected to mimic the data characteristics in application.

For each subject $i$ ( $i = 1, \dots, n$ ), the primary outcome of interest is the measure at the last visit $Y_{i m}$ that is generated from a binary random vector $Y_{i} = (Y_{i 1}, \dots, Y_{i m})^{T}$ with the success probability $p_{i} = (p_{i 1}, \dots, p_{i m})^{T}$ . The correlation structure for the primary outcome within-subject is exchangeable with the correlation coefficient of 0.5. Tthe marginal probability is $p_{i j} = 1 / {1 + \exp (- X_{i j}^{T} β_{0})}, j = 1, 2, \dots, m$ , where $β_{0} = (β_{0}, β_{1}, β_{2}, β_{3})^{T} = (1, - 1, 0.3, - 0.5)^{T}$ is the true regression parameter vector, and $X_{i j} = (1, X_{i 1}, X_{i j 2}, X_{i j 3})^{T}$ is the covariate vector. Specifically, $X_{i 1}$ is a subject-level covariate generated from the uniform distribution over $[0, 1]$ ; $X_{i j 2} = \log_{10} (j) + ϵ_{i j}^{N}$ , is a time-dependent covariate where the error term $ϵ_{i j}^{N} \sim N (0, {0.3}^{2})$ is i.i.d. among different subjects and time points; $X_{i j 3}$ is a variable generated from the binary vector $({\overset{ˇ}{X}}_{i 3}^{T}, {\overset{´}{X}}_{i 3}^{T})^{T}$ , where ${\overset{ˇ}{X}}_{i 3} = (X_{i 13}, \dots, X_{i j 3}, \dots, X_{i m 3})^{T}$ ; ${\overset{´}{X}}_{i 3} = ({\tilde{X}}_{i 13}, \dots, {\tilde{X}}_{i j 3}, \dots, {\tilde{X}}_{i m 3})^{T}$ that is used for the secondary model. Note that $E (X_{i j 3}) = 0.4$ , $E ({\tilde{X}}_{i j 3}) = 0.5$ and for $1 \leq j, j^{'} \leq m$ , $C o r (X_{i j 3}, {\tilde{X}}_{i j^{'} 3})$ equals 0.6 if $j = j^{'}$ and equals 0.1 otherwise, $i = 1, \dots, n; j = 1, \dots, m$ .

The generating model for secondary outcomes is ${\tilde{Y}}_{i j} = {\tilde{X}}_{i j}^{T} θ_{0} + {\tilde{ϵ}}_{i j}$ , where the true parameter is $θ_{0} = (θ_{0}, θ_{1}, θ_{2}, θ_{3})^{T} = (2, - 0.5, 1, - 1)^{T}$ and the covariate is ${\tilde{X}}_{i j} = (1, X_{i 1}, X_{i j 2}, {\tilde{X}}_{i j 3})^{T}$ , where ${\tilde{X}}_{i j 3}$ is defined above. To induce an association between the primary and secondary outcomes, we generate the error term by ${\tilde{ϵ}}_{i j} = r_{0} W_{i j} + (1 - r_{0}^{2})^{0.5} ϵ_{i j}$ , where $W_{i j}$ is the standardized value of $Y_{i j}$ , that is, $W_{i j} = (Y_{i j} - p_{i j}) / \sqrt{p_{i j} (1 - p_{i j})}$ , and $ϵ_{i j} \sim N (0, 1)$ . The parameter $r_{0}$ is set to be $0.89$ so that the correlation between $Y_{i m}$ and ${\tilde{Y}}_{i j}$ is around $0.8$ for $j = m$ , and is around $0.4$ otherwise.

Next, we add dropouts for the binary vector $Y_{i}$ and generate $Y_{i m}$ with missing data under MAR. The probability of observing the outcome at time $j$ for subject $i$ is calculated as $w_{i j} = ξ_{i 1} \times \dots \times ξ_{i j}$ , where $ξ_{i 1} = 1$ given the data are always observed at baseline; for $j = 2, \dots, m$ , we have $ξ_{j} = P r (R_{i j} = 1 | Z_{i j}) = 1 / {1 + \exp (- Z_{i j}^{T} η_{0})}$ with the true parameter $η_{0} = (η_{0}, η_{1}, η_{2})^{T} = (2.65, - 3, 3)^{T}$ and $Z_{i j} = (1, X_{i j 2}, Y_{i, j - 1})^{T}$ , and note that $R_{i j} = 0$ if $R_{i, j - 1} = 0$ because of dropout. Under such a parameter set-up, the observation probability at the last visit, $P r (R_{m} = 1)$ , is around 70%.

For the primary, missing data, and secondary analysis, we consider the logistic regression, the marginal model of weighted GEE (WGEE),¹⁸ and the over-identified estimating functions suggested by Chen et al.,⁸ respectively. The specification of these estimating functions is presented in the Supplemental Material. In order to evaluate the estimation performance, summary statistics including bias, relative bias (RB), Monte Carlo standard deviation (SD), standard error (SE), relative efficiency (RE), and coverage probability (CP) are reported based on 1000 Monte Carlo replications. In the following, we first examine the cases with fully observed secondary outcomes to investigate the equivalence of the FJ and PP estimators and the robustness to secondary model misspecification. We proceed to examine the cases with missing secondary outcomes.

3.1. Equivalence of the FJ and PP estimators

As shown in Table 1, the estimation results of the FJ and PP estimators are comparable under finite samples. Compared to the complete case analysis that does not adjust for the missing data, the bias of the PP estimator is significantly reduced. This confirms that the FJ and PP methods are effective ways to handle missing data. Among these four parameters, the estimate for $β_{2}$ has the largest bias, which is not only because the missing mechanism directly involves $X_{2}$ , but also because the magnitude of the true value for $β_{2}$ is small (e.g. 0.3). The performance of the estimators seems unsatisfactory when $n = 500$ , but as the sample size increases, the bias gradually shrinks towards 0, SE deviates less from SD, and CP approaches the nominal level. Note that the bias with RB around 1% is trivial and can be viewed as random sample error, which agrees with our theoretical derivation and indicates the deviance is simply due to the finite sample error. Across all scenarios, the RE is > 1, and the IIB is about 0.2, indicating the efficiency of all estimators has improved, and the variance has reduced by 20% on average after applying the ELW. This can serve as a benchmark for comparison with the remaining cases. As the equivalence is verified, we use the PP estimator in the following cases.

Table 1.
Estimation results to evaluate equivalence of the FJ and PP estimators. All secondary outcomes are observed. All statistics except RE are multiplied by 100.

Naive FJ PP

n Para Bias RB Bias RB SD SE RE CP (%) IIB (%) Bias RB SD SE RE CP(%) IIB (%)

500 $β_{0}$ 20.0 20.0 −0.2 0.2 35.6 32.8 1.26 92.9 21.9 −0.2 0.2 35.7 32.8 1.26 92.6 21.8

$β_{1}$ 1.6 1.6 −1.3 1.3 42.4 42.2 1.09 95.1 −1.3 1.3 42.4 42.2 1.09 95.3

$β_{2}$ 40.5 135.1 4.6 15.5 42.2 37.7 1.41 92.2 4.6 15.5 42.3 37.7 1.41 92.2

$β_{3}$ −3.6 7.3 −0.3 0.7 25.6 24.8 1.10 93.2 −0.3 0.6 25.6 24.8 1.10 93.1

1000 $β_{0}$ 18.9 18.9 −0.8 0.8 25.4 23.6 1.27 92.6 20.3 −0.8 0.8 25.4 23.6 1.27 92.7 20.3

$β_{1}$ 1.8 1.8 0.2 0.2 32.0 30.2 1.09 94.1 0.3 0.3 32.0 30.2 1.09 94.1

$β_{2}$ 40.2 134.0 1.8 6.0 29.3 27.7 1.35 92.8 1.8 6.0 29.3 27.7 1.35 92.7

$β_{3}$ −3.1 6.2 0.2 0.5 18.9 17.7 1.06 92.8 0.2 0.5 18.9 17.7 1.06 92.8

2000 $β_{0}$ 19.9 19.9 0.0 0.0 16.9 16.9 1.36 94.1 19.4 0.0 0.0 16.9 16.9 1.36 94.2 19.4

$β_{1}$ 1.1 1.1 −0.7 0.7 21.7 21.5 1.10 94.0 −0.6 0.6 21.7 21.5 1.10 94.0

$β_{2}$ 39.1 130.2 1.2 3.9 20.6 20.0 1.41 93.5 1.2 3.9 20.6 20.0 1.41 93.6

$β_{3}$ −3.7 7.3 −0.4 0.9 13.0 12.6 1.12 94.1 −0.4 0.9 13.0 12.6 1.12 94.1

		Naive	FJ	PP
500	$β_{0}$	20.0	20.0	−0.2	0.2	35.6	32.8	1.26	92.9	21.9	−0.2	0.2	35.7	32.8	1.26	92.6	21.8
	$β_{1}$	1.6	1.6	−1.3	1.3	42.4	42.2	1.09	95.1		−1.3	1.3	42.4	42.2	1.09	95.3
	$β_{2}$	40.5	135.1	4.6	15.5	42.2	37.7	1.41	92.2		4.6	15.5	42.3	37.7	1.41	92.2
	$β_{3}$	−3.6	7.3	−0.3	0.7	25.6	24.8	1.10	93.2		−0.3	0.6	25.6	24.8	1.10	93.1
1000	$β_{0}$	18.9	18.9	−0.8	0.8	25.4	23.6	1.27	92.6	20.3	−0.8	0.8	25.4	23.6	1.27	92.7	20.3
	$β_{1}$	1.8	1.8	0.2	0.2	32.0	30.2	1.09	94.1		0.3	0.3	32.0	30.2	1.09	94.1
	$β_{2}$	40.2	134.0	1.8	6.0	29.3	27.7	1.35	92.8		1.8	6.0	29.3	27.7	1.35	92.7
	$β_{3}$	−3.1	6.2	0.2	0.5	18.9	17.7	1.06	92.8		0.2	0.5	18.9	17.7	1.06	92.8
2000	$β_{0}$	19.9	19.9	0.0	0.0	16.9	16.9	1.36	94.1	19.4	0.0	0.0	16.9	16.9	1.36	94.2	19.4
	$β_{1}$	1.1	1.1	−0.7	0.7	21.7	21.5	1.10	94.0		−0.6	0.6	21.7	21.5	1.10	94.0
	$β_{2}$	39.1	130.2	1.2	3.9	20.6	20.0	1.41	93.5		1.2	3.9	20.6	20.0	1.41	93.6
	$β_{3}$	−3.7	7.3	−0.4	0.9	13.0	12.6	1.12	94.1		−0.4	0.9	13.0	12.6	1.12	94.1

Naive: complete case model; Para: regression parameters $β$ ; RB: relative bias, absolute ratio of bias over the true parameter values; SD: Monte Carlo standard deviation; SE: standard error; RE: relative efficiency; CP: 95% coverage probability; IIB: index for information borrowing; JM: joint modeling.

3.2. Robustness to secondary model misspecification

Now we examine the robustness of our proposed method to the model misspecification in the secondary analysis. Recall that the generating model for secondary outcomes takes the form $\tilde{Y} \sim 1 + X_{1} + X_{2} + \tilde{X_{3}}$ , so we consider two misspecification scenarios. In the first scenario, we remove $X_{1}$ from secondary analysis. In other words, we regress $\tilde{Y}$ on ${1, X_{2}, \tilde{X_{3}}}$ . Then we derive new propensity scores and apply them to the primary analysis as before. Similarly, in the second scenario, we use $X_{3}$ instead of $\tilde{X_{3}}$ for secondary analysis, which means we regress $\tilde{Y}$ on ${1, X_{1}, X_{2}, X_{3}}$ . Noting that even though $X_{3}$ and $\tilde{X_{3}}$ are correlated, they are not identical. Hence the second case is also a misspecified model.

The results are summarized in Table 2. Again, the estimator is consistent to the true value, and the SE and CP are close to the SD and nominal level, respectively. Next, we look at the precision improvement. In the first case, the IIB becomes smaller, which is about 0.15, compared with the benchmark. This is mainly because the RE of $β_{1}$ is slightly below 1, which is not surprising because $X_{1}$ is not included in secondary analysis. The RE for the other parameters are similar to or slightly smaller than the one with correctly specified model. As for the second case, it is interesting to note that the IIB is even bigger than the benchmark. This difference is mainly due to the larger RE for $β_{3}$ . It suggests that using covariates in the primary analysis as the regressors in secondary analysis may lead to better variance reduction. Overall, this simulation verifies that our method is robust to the model misspecification.

Table 2.
Estimation results to evaluate robustness to secondary model misspecification. All secondary outcomes are observed. All statistics except RE are multiplied by 100.

No $X_{1}$ Use $X_{3}$ instead of ${\tilde{X}}_{3}$

n Para Bias RB SD SE RE CP (%) IIB (%) Bias RB SD SE RE CP (%) IIB (%)

500 $β_{0}$ −0.1 0.1 36.4 34.1 1.21 93.4 16.7 0.3 0.3 35.9 33.4 1.24 93.3 23.3

$β_{1}$ −1.4 1.4 44.5 45.3 0.99 95.5 −1.5 1.5 42.9 42.8 1.07 95.3

$β_{2}$ 4.7 15.5 42.5 38.0 1.39 92.3 4.1 13.5 43.6 38.9 1.33 91.8

$β_{3}$ −0.4 0.8 25.6 24.9 1.10 92.9 0.1 0.2 23.8 22.5 1.28 93.3

1000 $β_{0}$ −0.7 0.7 25.9 24.4 1.22 93.9 15.5 −0.6 0.6 25.9 24.0 1.22 93.6 21.8

$β_{1}$ 0.2 0.2 33.6 32.3 0.99 93.8 0.1 0.1 32.2 30.6 1.07 93.8

$β_{2}$ 1.8 6.0 29.4 27.8 1.34 93.1 1.7 5.6 30.0 28.5 1.29 92.6

$β_{3}$ 0.2 0.4 19.0 17.7 1.06 92.7 0.0 0.0 17.2 16.1 1.29 93.4

2000 $β_{0}$ 0.2 0.2 17.4 17.4 1.28 94.9 14.8 0.0 0.0 17.1 17.1 1.32 94.7 20.9

$β_{1}$ −1.0 1.0 22.8 23.0 1.00 94.1 −0.7 0.7 21.9 21.8 1.09 94.8

$β_{2}$ 1.2 4.0 20.6 20.1 1.40 94.0 1.3 4.3 21.2 20.5 1.33 94.2

$β_{3}$ −0.4 0.9 13.0 12.6 1.13 94.0 −0.4 0.8 12.0 11.5 1.31 94.0

		No $X_{1}$	Use $X_{3}$ instead of ${\tilde{X}}_{3}$
500	$β_{0}$	−0.1	0.1	36.4	34.1	1.21	93.4	16.7	0.3	0.3	35.9	33.4	1.24	93.3	23.3
	$β_{1}$	−1.4	1.4	44.5	45.3	0.99	95.5		−1.5	1.5	42.9	42.8	1.07	95.3
	$β_{2}$	4.7	15.5	42.5	38.0	1.39	92.3		4.1	13.5	43.6	38.9	1.33	91.8
	$β_{3}$	−0.4	0.8	25.6	24.9	1.10	92.9		0.1	0.2	23.8	22.5	1.28	93.3
1000	$β_{0}$	−0.7	0.7	25.9	24.4	1.22	93.9	15.5	−0.6	0.6	25.9	24.0	1.22	93.6	21.8
	$β_{1}$	0.2	0.2	33.6	32.3	0.99	93.8		0.1	0.1	32.2	30.6	1.07	93.8
	$β_{2}$	1.8	6.0	29.4	27.8	1.34	93.1		1.7	5.6	30.0	28.5	1.29	92.6
	$β_{3}$	0.2	0.4	19.0	17.7	1.06	92.7		0.0	0.0	17.2	16.1	1.29	93.4
2000	$β_{0}$	0.2	0.2	17.4	17.4	1.28	94.9	14.8	0.0	0.0	17.1	17.1	1.32	94.7	20.9
	$β_{1}$	−1.0	1.0	22.8	23.0	1.00	94.1		−0.7	0.7	21.9	21.8	1.09	94.8
	$β_{2}$	1.2	4.0	20.6	20.1	1.40	94.0		1.3	4.3	21.2	20.5	1.33	94.2
	$β_{3}$	−0.4	0.9	13.0	12.6	1.13	94.0		−0.4	0.8	12.0	11.5	1.31	94.0

Para: regression parameters $β$ ; RB: relative bias, absolute ratio of bias over the true parameter values; SD: Monte Carlo standard deviation; SE: standard error; RE: relative efficiency; CP: 95% coverage probability; IIB: index of information borrowing.

3.3. Secondary outcomes with missing data

In the previous cases, we assumed secondary outcomes are completely observed. Now we relax this assumption and allow missing data to occur in secondary outcomes, which is commonly encountered in practice. Here, we consider two scenarios. The first one is the case where the secondary outcome has exactly the same missing pattern as the primary outcome. In other words, the secondary outcome is missing simultaneously along with the primary one. The second scenario is to consider primary and secondary outcomes with different missing patterns. For subject $i$ at time $j = 1, 2, \dots, m$ , the missing indicator for the secondary outcome is ${\tilde{R}}_{i j} = R_{i j} R *_{i j}$ , where $R *_{i j}$ is another dropout missing indicator defined as follows. Let $w *_{i j} = P r (R *_{i j} = 1) = ξ *_{i 1} \times \dots \times ξ *_{i j}$ , where $ξ *_{i 1} = P r (R *_{i 1} = 1) = 1$ ; for $j = 2, \dots, m$ , $ξ_{j} * = P r (R_{i j} * = 1 | Z_{i j} *) = 1 / {1 + \exp (- Z_{i j}^{* T} η_{0} *)}$ with the true parameter $η_{0} * = (η_{0} *, η_{1} *, η_{2} *)^{T} = (6.5, 1, - 2)^{T}$ and the covariates are $Z_{i j} * = (1, {\tilde{X}}_{i j 3}, {\tilde{Y}}_{i, j - 1})^{T}$ , and similar to above, $R *_{i j} = 0$ if $R *_{i, j - 1} = 0$ . This parameter set-up can lead the observation probability at the last visit, $P r (R_{m} = 1)$ , to be around 50%, and secondary outcomes also have a dropout missing pattern but with a higher missing probability than the primary outcome. For uniform mapping, we consider JM imputation instead of MICE because JM has larger IIB.

The simulation results are summarized in Table 3. As discussed, the classic ELW method may introduce bias when there is no unified $θ_{*}$ that can ensure $E [h (D^{s}; θ_{*})] = 0$ holds. As shown in Table 3, the estimators that simply use the IPW method for the primary analysis and do not handle missing secondary outcomes have significant bias. For example, in the first simultaneous missing case, the estimate of $β_{2}$ has about 15% RB, even when the sample size is large (e.g. $n = 2000$ ). In comparison, after implementing uniform mapping to handle secondary outcomes, the estimates recover their consistency in both cases. In terms of variance reduction, the combination of IPW and uniform mapping in the simultaneous missing case has similar IIB compared to the benchmark. This illustrates the power of uniform mapping that can potentially “recover” secondary outcomes perfectly in the sense that it provides as much efficiency improvement as the fully observed data. On the other hand, the IIB in different missing cases is slightly smaller than the benchmark, which is not surprising because there are more missing data in secondary outcomes and the missing pattern is more complicated. Overall, our proposed estimator still works well after the missing secondary outcomes are properly handled by the uniform mapping strategy.

Table 3.
Estimation results for the scenarios with missing secondary outcomes. All statistics except RE are multiplied by 100.

Simultaneous missing Different missing

No UM Use UM No UM Use UM

n Para Bias RB Bias RB SD SE RE CP (%) IIB (%) Bias RB Bias RB SD SE RE CP (%) IIB (%)

500 $β_{0}$ −9.9 9.9 −2.2 2.2 33.6 32.2 1.42 94.9 23.1 3.6 3.6 −1.5 1.5 36.4 33.1 1.21 93.0 20.4

$β_{1}$ −4.9 4.9 −1.0 1.0 41.1 41.7 1.16 95.7 −3.9 3.9 −1.7 1.7 41.6 42.5 1.13 95.5

$β_{2}$ 7.7 25.8 3.6 12.1 39.4 36.6 1.63 92.9 −5.4 18.1 3.2 10.6 42.5 37.8 1.40 91.5

$β_{3}$ 0.0 0.0 −0.3 0.5 24.9 24.5 1.17 93.4 −2.0 3.9 0.1 0.1 25.6 24.7 1.11 93.9

1000 $β_{0}$ −10.1 10.1 −3.4 3.4 24.9 23.1 1.30 92.6 21.8 3.6 3.6 −2.0 2.0 25.8 23.7 1.22 92.7 19.2

$β_{1}$ −3.9 3.9 0.6 0.6 31.0 29.8 1.16 94.7 −2.9 2.9 −0.3 0.3 31.4 30.3 1.13 94.4

$β_{2}$ 4.9 16.3 2.0 6.6 28.1 26.8 1.47 93.8 −8.3 27.6 1.1 3.7 30.4 27.6 1.26 92.5

$β_{3}$ 0.8 1.6 0.4 0.8 18.9 17.5 1.07 93.4 −1.3 2.6 0.2 0.4 19.1 17.6 1.04 93.1

2000 $β_{0}$ −9.4 9.4 −1.7 1.7 16.4 16.5 1.43 94.7 20.9 4.5 4.5 −0.3 0.3 17.8 16.9 1.23 94.2 18.4

$β_{1}$ −4.2 4.2 −0.7 0.7 20.6 21.2 1.22 95.3 −3.4 3.4 −1.6 1.6 21.6 21.6 1.11 95.5

$β_{2}$ 4.3 14.3 0.6 1.9 19.5 19.4 1.58 94.1 −9.2 30.6 −0.5 1.8 21.1 19.9 1.34 93.4

$β_{3}$ 0.0 0.1 −0.4 0.8 12.9 12.4 1.14 94.5 −2.2 4.4 −0.5 0.9 13.1 12.5 1.10 94.1

		Simultaneous missing	Different missing
500	$β_{0}$	−9.9	9.9	−2.2	2.2	33.6	32.2	1.42	94.9	23.1	3.6	3.6	−1.5	1.5	36.4	33.1	1.21	93.0	20.4
	$β_{1}$	−4.9	4.9	−1.0	1.0	41.1	41.7	1.16	95.7		−3.9	3.9	−1.7	1.7	41.6	42.5	1.13	95.5
	$β_{2}$	7.7	25.8	3.6	12.1	39.4	36.6	1.63	92.9		−5.4	18.1	3.2	10.6	42.5	37.8	1.40	91.5
	$β_{3}$	0.0	0.0	−0.3	0.5	24.9	24.5	1.17	93.4		−2.0	3.9	0.1	0.1	25.6	24.7	1.11	93.9
1000	$β_{0}$	−10.1	10.1	−3.4	3.4	24.9	23.1	1.30	92.6	21.8	3.6	3.6	−2.0	2.0	25.8	23.7	1.22	92.7	19.2
	$β_{1}$	−3.9	3.9	0.6	0.6	31.0	29.8	1.16	94.7		−2.9	2.9	−0.3	0.3	31.4	30.3	1.13	94.4
	$β_{2}$	4.9	16.3	2.0	6.6	28.1	26.8	1.47	93.8		−8.3	27.6	1.1	3.7	30.4	27.6	1.26	92.5
	$β_{3}$	0.8	1.6	0.4	0.8	18.9	17.5	1.07	93.4		−1.3	2.6	0.2	0.4	19.1	17.6	1.04	93.1
2000	$β_{0}$	−9.4	9.4	−1.7	1.7	16.4	16.5	1.43	94.7	20.9	4.5	4.5	−0.3	0.3	17.8	16.9	1.23	94.2	18.4
	$β_{1}$	−4.2	4.2	−0.7	0.7	20.6	21.2	1.22	95.3		−3.4	3.4	−1.6	1.6	21.6	21.6	1.11	95.5
	$β_{2}$	4.3	14.3	0.6	1.9	19.5	19.4	1.58	94.1		−9.2	30.6	−0.5	1.8	21.1	19.9	1.34	93.4
	$β_{3}$	0.0	0.1	−0.4	0.8	12.9	12.4	1.14	94.5		−2.2	4.4	−0.5	0.9	13.1	12.5	1.10	94.1

4. Data application

Dementia is a common syndrome among the elderly, characterized by a decline in memory and cognitive functioning leading to a loss of independent living. With the rapid growth of the elderly population, the number of people living with dementia is expected to reach 65.7 million by 2030 and 115.4 million by 2050 worldwide.¹⁹ To illustrate our method and meet clinical needs, we use the Uniform Data Set from the National Alzheimer’s Coordinating Center (NACC)²⁰ to investigate baseline risk factors for developing dementia after a three-year follow-up, which holds primary clinical significance. The onset of dementia is a gradual process that unfolds over a relatively extended period. Gaining insight into dementia risk within this timeframe and pinpointing associated risk factors can facilitate early disease detection and enhance healthcare management.^21,22

Our analysis focuses on subjects aged 55 or older who have not developed dementia at baseline. To better demonstrate the empirical performance of our method, we randomly select one Alzheimer’s Disease Research Centers, with site ID 8646, and use the subjects enrolled between 2005 and 2014 for our analysis, resulting in a sample size of 733. The summary statistics of subjects in this site and the others are provided in the Supplemental Materials. The baseline risk factors of interest include age, sex (female vs. male), race (non-white vs. white), years of education, body mass index (BMI), depression, and APOE genotype (including allele $ε$ 4 vs. others). The secondary outcomes are the MMSE scores, with a scale ranging from 0 to 30, where a lower score indicates weaker cognitive functioning. The complete case correlation between MMSE and occurrence of dementia at three-year follow-up is $- 0.52$ . Although MMSE is correlated with the primary outcome, it is not straightforward to incorporate it into the primary analysis as a covariate since it is in longitudinal format. Ideally, each subject has four measurements during the three-years follow-up (one at baseline and three annual follow-ups), but some of them dropped out, and only 457 subjects (62%) out of 733 subjects had complete measurements. Trajectories of the primary and secondary outcomes for five randomly selected subjects are displayed in Figure 1. The figure shows that all subjects were observed at baseline, but some dropped out during the follow-up, and for those who dropped out, their primary outcome of interest (i.e. developing dementia 3 years later) was not observed, and their secondary outcomes are only partially observed. These data features and challenges provide an opportunity to demonstrate the benefits of our proposed method.

Figure 1.

Trajectories of outcomes for five randomly selected subjects from the National Alzheimer’s Coordinating Center (NACC) study (panel A: primary outcomes and panel B: secondary outcomes).

We adopt the logistic regression for the primary analysis, regressing the primary outcome on the baseline risk factors of interest listed above. For the missing data analysis, we use the dropout model by Robins et al.¹⁸ as in the simulation. The covariates are risk factors and MMSE at the last visit instead of the baseline measurements because we believe the last measurements are more informative and indeed the model has smaller AIC, indicating a better fit. Analysis reveals that the higher MMSE at the last visit, the more years of education are significant factors for subjects to be observed in follow-up. Additionally, dropout is more likely to occur for individuals with dementia at the last visit (marginally significant), higher age at the last visit, and non-white race. These findings provide evidence supporting that the primary outcome is not MCAR and the detailed results are presented in the Supplemental Materials. For the secondary analysis, we regress MMSE on risk factors at the same visit. Because some time-varying covariates such as BMI are missing simultaneously with the outcome MMSE, we impute them as well. We use the JM imputation for uniform mapping, and we impute the data for 100 times. After imputing data, we conduct the regression with the same working model used in the simulation. We use the JM imputation for uniform mapping, and we impute the data for 100 times. After imputing data, we conduct the regression with the same working model used in the simulation. The p-value for testing whether ${\hat{β}}_{P P}$ is equivalent to ${\hat{β}}_{I P W}$ is $0.10$ . As comparison, the p-value without uniform mapping is below $0.0001$ . It validates the role of uniform mapping in reducing bias and the robustness of our proposed method.

The results of the primary analysis using different models are summarized in Table 4. The naive and IPW models produce similar results, with both indicating the significance of covariates APOE, age, depression, and BMI. Specifically, the APOE genotype that carrying $ε$ 4 allele, higher age, and a diagnosis of depression increase the risk of dementia, whereas higher BMI is protective against the development of dementia. These results are consistent with the existing literature.^23–25 In comparison with the IPW model, our proposed method yields similar coefficient estimates and lower variability in general. All estimates, except that of the education covariate, show an improvement in efficiency. Our method reduces variance by $\sim$ 16% on average, as indicated by the IIB. Besides detecting the significant covariates by the IPW model, our proposed method further identifies the significance of the race covariate, owning to the substantial efficiency improvement in its coefficient (RE = 1.53). This suggests that non-white people are more prone to develop dementia, which has been supported by some literature based on large-scale studies.^26,27

Table 4.

Estimation results for the risk-effect model with three-year dementia as the primary outcome using the NACC study.

	Naive			IPW			IPW $+$ UM $+$ ELW (IIB: 16%)
	EST	SE	P	EST	SE	P	EST	SE	P	RE
Intercept	−9.50	2.56	0.00	−9.39	2.55	0.00	−7.79	2.38	0.00	1.15
Female	−0.32	0.35	0.36	−0.23	0.37	0.53	−0.10	0.35	0.77	1.13
Education	0.00	0.06	0.99	0.04	0.06	0.51	0.07	0.06	0.20	0.99
Race	0.53	0.52	0.30	0.81	0.56	0.15	1.01	0.45	0.03	1.53
APOE	1.17	0.34	0.00	1.39	0.36	0.00	1.22	0.33	0.00	1.20
Age	0.14	0.03	0.00	0.13	0.03	0.00	0.09	0.02	0.00	1.34
Depression	1.04	0.42	0.01	0.97	0.42	0.02	0.84	0.40	0.04	1.09
BMI	−0.13	0.05	0.00	−0.14	0.05	0.01	−0.12	0.05	0.01	1.25

Naive: complete case model, not using IPW or ELW; IPW: model using IPW only; IPW $+$ UM $+$ ELW: the proposed model, the model using IPW, uniform mapping (UM), and ELW; IIB: index for information borrowing; EST: parameter estimates; SE: estimated standard error; P: p-value; RE: relative efficiency between coefficients of IPW model and the ones of IPW $+$ UM $+$ ELW model; Education: years of education; Race: non-white versus white; AOPE: genotype including $ε$ 4 allele versus others.

For more in-depth investigation, we also consider another longitudinal continuous variable, the global score from the Clinical Dementia Rating (CDR) Dementia Staging Instrument, as the secondary outcome, and the post-hoc analysis results are presented in the Supplemental Material due to space limitations and relatively weaker efficiency improvement. In practice, we recommend researchers to consider different variables as the secondary outcome for sensitivity analysis and explore their potentials for efficiency gain. How to synthesize information from multiple secondary outcomes simultaneously into the primary analysis to further improve the efficiency and robustness is one of our ongoing work.²⁸ Furthermore, we have not explicitly considered competing risk events, such as deaths, due to the relatively low death rate within our cohort over the three-year period (2%). Under such scenario, any impact on our results is expected to be minimal and negligible, rendering them robust and valid, which are further verified by our post-hoc empirical studies. However, informative censoring resulting from deaths warrants further investigation as part of our model extensions, a direction we intend to explore in our future work.

5. Discussion

In this work, we generalize an existing data borrowing method and propose an extension into the context of missing data with improved efficiency and robustness. We adopt the IPW technique to ensure the unbiasedness of the primary outcome analysis and further propose a uniform mapping strategy to take care of secondary outcomes analysis by employing multiple imputations to facilitate homogeneous secondary outcomes in the sense that it is governed by a unified $θ_{*}$ . Our method is robust to misspecified working models and the scenarios with missing secondary outcomes. The idea of uniform mapping represents a pioneering approach in the literature, opening new avenues for multiple imputations and demonstrating the potential of multiple imputations in robust statistics and missing data analysis.

It’s worth noting that there are alternative techniques for leveraging information from secondary outcomes, such as structural equation modeling (SEM), although they are not applied in our current context. SEM is a statistical method used to explore complex (causal and directional) relationships among variables, both observed and latent, requiring the specification of a path diagram and a model between primary and secondary outcomes. However, our method pursues a distinct objective aiming to improve the efficiency of our primary outcome analysis. Also, we do not necessitate the specification of directional relationships between primary and secondary outcomes, which could be challenging to disentangle in various circumstances. Therefore, our method offers broader applicability, robustness, and generalization. It’s worth emphasizing that our approach can be integrated into SEM to enhance causal inference within that specific framework, a direction we plan to delve into further in our future research.

Motivated by our application, we currently consider the primary outcome is cross-sectional and secondary outcomes are longitudinal. Of note, our method can be flexibly extended to other contexts beyond our current contexts. For instance, the IPW method can be applied to longitudinal data, such as WGEE.¹⁸ Our method can also handle cross-sectional secondary outcomes by constructing an over-identified estimating equation based on them.⁸ Besides, there are other ways to handle missing data in the primary analysis, such as imputation and the EM algorithm. How to combine these methods with ELW and uniform mapping effectively and efficiently is of great interest. Additionally, we aim to explore borrowing information from multiple secondary outcomes to further improve our method’s effectiveness. Furthermore, our current focus in data application lies on examining the main effects of the risk factors. However, it would be intriguing to explore their interactions within the model and devise information criteria for selecting interaction terms, thereby potentially enhancing the informativeness of our findings. These topics will be investigated in future studies.

While our work mainly assumes fully observed covariates, we acknowledge that in practice, the covariates are often missing simultaneously with the outcomes. Nevertheless, our proposed method remains valid when the outcomes are MAR, and we can easily extend our method to impute the missing covariates along with the outcomes. In some cases, the uniform mapping may introduce variability to the estimator, resulting in reduced efficiency compared to the ELW estimator not involving the uniform mapping. Thus, the uniform mapping strategy can be viewed as a remedy to the secondary analysis when a unified $θ_{*}$ cannot be found due to secondary outcomes under MAR. In future work, we will exclusively evaluate the strategy’s properties under our framework with finite samples to broaden our method’s practical applications. Furthermore, while the exact cause of dropouts is not clear from the data, it is likely that participants experiencing worsening cognitive impairment, functional difficulties and neuropsychiatric symptoms are more prone to being lost to follow-up. Therefore, addressing non-death-related dropouts requires further investigation and attention.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802241254195 - Supplemental material for Robust integration of secondary outcomes information into primary outcome analysis in the presence of missing data

Supplemental material, sj-pdf-1-smm-10.1177_09622802241254195 for Robust integration of secondary outcomes information into primary outcome analysis in the presence of missing data by Daxuan Deng, Vernon M Chinchilli, Hao Feng, Chixiang Chen and Ming Wang in Statistical Methods in Medical Research

Footnotes

Acknowledgements

Wang’s work was supported by the start-up funding from Department of Population and Quantitative Health Sciences at Case Western Reserve University. The contents of this article are solely the responsibility of the authors, and do not represent the official views of NIH.

This article was prepared using the NACC database, which is funded by the National Institute of Aging (NIA) of the NIH Grant U24 AG072122. NACC data. NACC data are contributed by the NIA-funded ADRCs: P30 AG062429 (PI James Brewer, MD, PhD), P30 AG066468 (PI Oscar Lopez, MD), P30 AG062421 (PI Bradley Hyman, MD, PhD), P30 AG066509 (PI Thomas Grabowski, MD), P30 AG066514 (PI Mary Sano, PhD), P30 AG066530 (PI Helena Chui, MD), P30 AG066507 (PI Marilyn Albert, PhD), P30 AG066444 (PI John Morris, MD), P30 AG066518 (PI Jeffrey Kaye, MD), P30 AG066512 (PI Thomas Wisniewski, MD), P30 AG066462 (PI Scott Small, MD), P30 AG072979 (PI David Wolk, MD), P30 AG072972 (PI Charles DeCarli, MD), P30 AG072976 (PI Andrew Saykin, PsyD), P30 AG072975 (PI David Bennett, MD), P30 AG072978 (PI Neil Kowall, MD), P30 AG072977 (PI Robert Vassar, PhD), P30 AG066519 (PI Frank LaFerla, PhD), P30 AG062677 (PI Ronald Petersen, MD, PhD), P30 AG079280 (PI Eric Reiman, MD), P30 AG062422 (PI Gil Rabinovici, MD), P30 AG066511 (PI Allan Levey, MD, PhD), P30 AG072946 (PI Linda Van Eldik, PhD), P30 AG062715 (PI Sanjay Asthana, MD, FRCP), P30 AG072973 (PI Russell Swerdlow, MD), P30 AG066506 (PI Todd Golde, MD, PhD), P30 AG066508 (PI Stephen Strittmatter, MD, PhD), P30 AG066515 (PI Victor Henderson, MD, MS), P30 AG072947 (PI Suzanne Craft, PhD), P30 AG072931 (PI Henry Paulson, MD, PhD), P30 AG066546 (PI Sudha Seshadri, MD), P20 AG068024 (PI Erik Roberson, MD, PhD), P20 AG068053 (PI Justin Miller, PhD), P20 AG068077 (PI Gary Rosenberg, MD), P20 AG068082 (PI Angela Jefferson, PhD), P30 AG072958 (PI Heather Whitson, MD), and P30 AG072959 (PI James Leverenz, MD).

Author contributions

All authors have made important contributions and have approved this work.

Data availability

The data that support the findings of this study is the NACC database, which is funded by National Institute of Aging (NIA) of the NIH Grant U24 AG072122. The NACC data are available upon request which should be directed to the National Alzheimer’s Coordinating Center ().

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship and/or publication of this article.

ORCID iDs

Hao Feng

Ming Wang

Supplemental material

Supplemental materials for this article are available online. Simulation codes for this article are available at GitHub ().

References

Arevalo-Rodriguez

Smailagic

Roqué-Figuls

et al. Mini-mental state examination (MMSE) for the early detection of dementia in people with mild cognitive impairment (MCI). Cochrane Database Syst Rev 2021; 7(7): CD010783.

Pangman

Sloan

Guse

. An examination of psychometric properties of the mini-mental state examination and the standardized mini-mental state examination: implications for clinical practice. Appl Nurs Res 2000; 13: 209–213.

Hughes

Berg

Danziger

et al. A new clinical scale for the staging of dementia. Br J Psychiatry 1982; 140: 566–572.

Chatterjee

Chen

Maas

et al. Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. J Am Stat Assoc 2016; 111: 107–117.

Han

Lawless

. Empirical likelihood estimation using auxiliary summary information with different covariate distributions. Stat Sin 2019; 29: 1321–1342.

Lumley

Shaw

Dai

. Connections between survey calibration estimators and semiparametric models for incomplete data. Int Stat Rev 2011; 79: 200–220.

Cheng

Taylor

et al. Informing a risk prediction model for binary outcomes with external coefficient information. J R Stat Soc: Ser C (Appl Stat) 2019; 68: 121–139.

Chen

Han

. Improving main analysis by borrowing information from auxiliary data. Stat Med 2022; 41: 567–579.

Rubin

. Inference and missing data. Biometrika 1976; 63: 581–592.

10.

Robins

Rotnitzky

Zhao

. Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc 1994; 89: 846–866.

11.

Godambe

. Estimating functions. Oxford: Oxford University Press, 1991.

12.

Newey

McFadden

. Large sample estimation and hypothesis testing. Handbook Economet 1994; 4: 2111–2245.

13.

Qin

Lawless

. Empirical likelihood and general estimating equations. Ann Stat 1994; 22: 300–325.

14.

Carpenter

Kenward

Vansteelandt

. A comparison of multiple imputation and doubly robust estimation for analyses with missing data. J R Stat Soc: Ser A (Stat Soc) 2006; 169: 571–584.

15.

Perkins

Cole

Harel

et al. Principled approaches to missing data in epidemiologic studies. Am J Epidemiol 2017; 187: 568–575.

16.

Carpenter

Smuk

. Missing data: a statistical framework for practice. Biometrical J 2021; 63: 915–947.

17.

Lindsay

. Improving generalised estimating equations using quadratic inference functions. Biometrika 2000; 87: 823–836.

18.

Robins

Rotnitzky

Zhao

. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J Am Stat Assoc 1995; 90: 106–121.

19.

Prince

Bryce

Albanese

et al. The global prevalence of dementia: a systematic review and metaanalysis. Alzheimers Dement 2013; 9: 63–75.

20.

Beekly

Ramos

Lee

et al. The national Alzheimer’s coordinating center (NACC) database: the uniform data set. Alzheimer Dis Assoc Disord 2007; 21: 249–258.

21.

Grober

Sanders

Hall

et al. Free and cued selective reminding identifies very mild dementia in primary care. Alzheimer Dis Assoc Disord 2010; 24: 284.

22.

Derby

Burns

Wang

et al. Screening for predementia AD: time-dependent operating characteristics of episodic memory tests. Neurology 2013; 80: 1307–1314.

23.

Schaffert

LoBue

White III

et al. Risk factors for earlier dementia onset in autopsy-confirmed Alzheimer’s disease, mixed Alzheimer’s with Lewy bodies, and pure Lewy body disease. Alzheimers Dement 2020; 16: 524–530.

24.

Fitzpatrick

Kuller

Lopez

et al. Midlife and late-life obesity and the risk of dementia: cardiovascular health study. Arch Neurol 2009; 66: 336–342.

25.

Qizilbash

Gregson

Johnson

et al. BMI and risk of dementia in two million people over two decades: a retrospective cohort study. Lancet Diabetes Endocrinol 2015; 3: 431–436.

26.

Kornblith

Bahorik

Boscardin

et al. Association of race and ethnicity with incidence of dementia among older adults. J Am Med Assoc 2022; 327: 1488–1495.

27.

Shiekh

Cadogan

Lin

et al. Ethnic differences in dementia risk: a systematic review and meta-analysis. J Alzheimers Dis 2021; 80(1): 337–355.

28.

Chen

Wang

Chen

. An efficient data integration scheme to synthesize information from multiple secondary outcomes into parameter inference in main analysis. Biometrics 2022. In Press.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.22 MB

		Naive		FJ							PP
n	Para	Bias	RB	Bias	RB	SD	SE	RE	CP (%)	IIB (%)	Bias	RB	SD	SE	RE	CP(%)	IIB (%)
500	$β_{0}$	20.0	20.0	−0.2	0.2	35.6	32.8	1.26	92.9	21.9	−0.2	0.2	35.7	32.8	1.26	92.6	21.8
	$β_{1}$	1.6	1.6	−1.3	1.3	42.4	42.2	1.09	95.1		−1.3	1.3	42.4	42.2	1.09	95.3
	$β_{2}$	40.5	135.1	4.6	15.5	42.2	37.7	1.41	92.2		4.6	15.5	42.3	37.7	1.41	92.2
	$β_{3}$	−3.6	7.3	−0.3	0.7	25.6	24.8	1.10	93.2		−0.3	0.6	25.6	24.8	1.10	93.1
1000	$β_{0}$	18.9	18.9	−0.8	0.8	25.4	23.6	1.27	92.6	20.3	−0.8	0.8	25.4	23.6	1.27	92.7	20.3
	$β_{1}$	1.8	1.8	0.2	0.2	32.0	30.2	1.09	94.1		0.3	0.3	32.0	30.2	1.09	94.1
	$β_{2}$	40.2	134.0	1.8	6.0	29.3	27.7	1.35	92.8		1.8	6.0	29.3	27.7	1.35	92.7
	$β_{3}$	−3.1	6.2	0.2	0.5	18.9	17.7	1.06	92.8		0.2	0.5	18.9	17.7	1.06	92.8
2000	$β_{0}$	19.9	19.9	0.0	0.0	16.9	16.9	1.36	94.1	19.4	0.0	0.0	16.9	16.9	1.36	94.2	19.4
	$β_{1}$	1.1	1.1	−0.7	0.7	21.7	21.5	1.10	94.0		−0.6	0.6	21.7	21.5	1.10	94.0
	$β_{2}$	39.1	130.2	1.2	3.9	20.6	20.0	1.41	93.5		1.2	3.9	20.6	20.0	1.41	93.6
	$β_{3}$	−3.7	7.3	−0.4	0.9	13.0	12.6	1.12	94.1		−0.4	0.9	13.0	12.6	1.12	94.1

		Simultaneous missing									Different missing
		No UM		Use UM							No UM		Use UM
n	Para	Bias	RB	Bias	RB	SD	SE	RE	CP (%)	IIB (%)	Bias	RB	Bias	RB	SD	SE	RE	CP (%)	IIB (%)
500	$β_{0}$	−9.9	9.9	−2.2	2.2	33.6	32.2	1.42	94.9	23.1	3.6	3.6	−1.5	1.5	36.4	33.1	1.21	93.0	20.4
	$β_{1}$	−4.9	4.9	−1.0	1.0	41.1	41.7	1.16	95.7		−3.9	3.9	−1.7	1.7	41.6	42.5	1.13	95.5
	$β_{2}$	7.7	25.8	3.6	12.1	39.4	36.6	1.63	92.9		−5.4	18.1	3.2	10.6	42.5	37.8	1.40	91.5
	$β_{3}$	0.0	0.0	−0.3	0.5	24.9	24.5	1.17	93.4		−2.0	3.9	0.1	0.1	25.6	24.7	1.11	93.9
1000	$β_{0}$	−10.1	10.1	−3.4	3.4	24.9	23.1	1.30	92.6	21.8	3.6	3.6	−2.0	2.0	25.8	23.7	1.22	92.7	19.2
	$β_{1}$	−3.9	3.9	0.6	0.6	31.0	29.8	1.16	94.7		−2.9	2.9	−0.3	0.3	31.4	30.3	1.13	94.4
	$β_{2}$	4.9	16.3	2.0	6.6	28.1	26.8	1.47	93.8		−8.3	27.6	1.1	3.7	30.4	27.6	1.26	92.5
	$β_{3}$	0.8	1.6	0.4	0.8	18.9	17.5	1.07	93.4		−1.3	2.6	0.2	0.4	19.1	17.6	1.04	93.1
2000	$β_{0}$	−9.4	9.4	−1.7	1.7	16.4	16.5	1.43	94.7	20.9	4.5	4.5	−0.3	0.3	17.8	16.9	1.23	94.2	18.4
	$β_{1}$	−4.2	4.2	−0.7	0.7	20.6	21.2	1.22	95.3		−3.4	3.4	−1.6	1.6	21.6	21.6	1.11	95.5
	$β_{2}$	4.3	14.3	0.6	1.9	19.5	19.4	1.58	94.1		−9.2	30.6	−0.5	1.8	21.1	19.9	1.34	93.4
	$β_{3}$	0.0	0.1	−0.4	0.8	12.9	12.4	1.14	94.5		−2.2	4.4	−0.5	0.9	13.1	12.5	1.10	94.1