Estimating trends in the incidence rate with interval censored data and time-dependent covariates

Abstract

We propose a multiple imputation method for estimating the incidence rate with interval censored data and time-dependent (and/or time-independent) covariates. The method has two stages. First, we use a semi-parametric G-transformation model to estimate the cumulative baseline hazard function and the effects of the time-dependent (and/or time-independent covariates) on the interval censored infection times. Second, we derive the participant's unique cumulative distribution function and impute infection times conditional on the covariate values. To assess performance, we simulated infection times from a Cox proportional hazards model and induced interval censoring by varying the testing rate, e.g., participants test 100%, 75%, 50% of the time, etc. We then compared the incidence rate estimates from our G-imputation approach with single random-point and mid-point imputation. By comparison, our G-imputation approach gave more accurate incidence rate estimates and appropriate standard errors for models with time-independent covariates only, time-dependent covariates only, and a mixture of time-dependent and time-independent covariates across various testing rates. We demonstrate, for the first time, a multiple imputation approach for incidence rate estimation with interval censored data and time-dependent (and/or time-independent) covariates.

Keywords

Interval censoring mid-point multiple imputation Cox proportional hazard model incidence rate HIV

1 Introduction

The gold-standard for estimating the incidence rate is to periodically test a cohort of uninfected participants for new infection events.¹ One limitation of this approach, however, is that we rarely (if ever) observe the exact time of infection. Instead, we know only that the infection event occurs within an interval censored by the latest-negative and earliest-positive test times. Our uncertainty about the timing of the infection event will be proportional to the length of the censoring interval, which will increase if tests are missed or scheduled at lengthy time intervals (i.e., every 12 or 24 months), as is often the case in large, population-based HIV incidence cohorts.^2,3 For example, in what year does the seroconversion event occur if a participant tests HIV-negative in June 2016, misses the scheduled test in July 2017, and tests HIV-positive in June 2018? Because the censoring interval extends across one or more years, we cannot identify the aggregating interval (i.e., the year) in which the infection event truly occurs. Extended interval censoring poses methodological challenges for accurately estimating the rate of new infections over time.⁴

A common solution to the interval censoring problem is to impute an infection time. After right-censoring the data, the incidence rate can then be estimated using standard statistical methods.^5–9 However, popular deterministic approaches, which impute an infection time at the mid-point or end-point of the censoring interval, can lead to artefactual trends in the incidence rates.⁴ A better approach is to multiply impute a single infection time from a uniform distribution, which gives estimates close enough to the true incidence rate and accounts for variation across the imputed values.⁴ Nevertheless, the single random-point method assumes that the hazard of infection is constant across the censored interval, thus discarding potentially useful information about the timing of the infection event. For example, in sub-Saharan Africa, we know that the time to HIV acquisition is shorter for younger women (aged 19–25 years) when compared with their demographic counterparts.^10–13 Use of covariate data could reduce uncertainty about the infection time and improve the accuracy of our incidence rate estimates.

In recent years, several methods have been developed to infer the effects of covariates on interval censored infection times.^9,14 Hsu et al.¹⁵ proposed a Cox proportional hazards model with time-independent covariates to identify the nearest neighbors of infection. They then derived a non-parametric distribution (analogous to a Kaplan-Meier estimator) from this neighborhood and imputed the infection times. Pan¹⁶ also developed a multiple imputation approach for failure times using a Cox proportional hazards model with time-independent covariates, similar to Chen and Sun.¹⁷ In these model-based approaches, the aim was to estimate the regression parameters for interval censored data, with empirical demonstrations limited to the case of one or more time-independent covariates. To date, these models have not been extended to estimate the incidence rate with interval censored data and time-dependent covariates.

To estimate the incidence rate with interval censored data, we propose a multiple imputation approach that is a function of the participant's time-dependent and/or time-independent covariates. Our imputation approach has two stages. For stage 1, we use a semi-parametric G-transformation model to estimate the cumulative baseline hazard function and the effects of covariates on the interval censored infection times. The G-transformation model was recently developed by Zeng et al.^18,19 and advances existing proportional hazard methods by accommodating interval censored data and time-dependent covariates. For stage 2, we derive the participant's unique cumulative distribution function and impute the infection times conditional on the covariate values, which is the main contribution of our study. We undertook a simulation study to evaluate the performance of our G-imputation model for incidence rate estimation under various testing rates. For our empirical demonstration, we used interval censored data from one of sub-Saharan Africa's largest and longest-running HIV surveillance systems to estimate the annual incidence rate conditional on time-independent (sex) and time-dependent (age, marital status, and community HIV prevalence) covariates.

2 Methods

We propose a method for estimating trends in the incidence rate with interval censored data and time-dependent and/or time-independent covariates. Consider an incidence cohort study with $i = 1, \dots, n$ participants. Let K denote the number of test times and let $U_{k}$ denote the k th test time, where $0 = U_{1} < U_{2} < \dots < U_{K}$ ( $k = 1, \dots, K$ ). We set $K \geq 2$ to indicate that the n participants are tested at least twice for the infection event. Denote the infection time by T where T could be the number of days, months, or years since the earliest-negative test time $(U_{1})$ . Participants must test negative at $U_{1}$ (hence $T > 0)$ . If L and R are the latest-negative and earliest-positive test times respectively, then $(L, R]$ is the smallest interval that brackets $T .$ Thus, $L = max (U_{k} : U_{k} < T)$ and $R = min (U_{k} : U_{k} \geq T)$ . We say that T is censored into a non-zero interval such that no additional information is given on the timing of the infection event.¹⁴

We use a two-stage approach to impute a value for T conditional on the participant's covariate data. For the first stage, we estimate the cumulative baseline hazard function and the effects of the covariates on the infection risk. Previously, from a proportional hazards framework, this could only be done for interval censored data with time-independent covariates.¹⁸ Zeng et al.¹⁹ have proposed nonparametric maximum likelihood estimation of a broad class of semiparametric models that allow for one or more possibly time-dependent covariates. Under the semiparametric transformation model, the cumulative hazard function for T conditional on $Z (\cdot)$ can be written as

Λ (t, Z) = G (\int_{0}^{t} e^{β^{T} Z (s)} d Λ (s))

(1)

where β is a d-vector of regression parameters,

Z (\cdot)

is a d-vector of possibly time-dependent covariates,

Λ (\cdot)

is an unknown increasing function, and

G (\cdot)

is a strictly increasing transformation function

.

The choice

G (x) = x

yields the proportional hazards model. Later, Zeng et al.¹⁸ developed an EM-type algorithm for estimating the regression parameters β and the cumulative baseline hazard

Λ (s)

from interval censored data without imputing the failure times. Through extensive numerical studies, Zeng et al.¹⁸ demonstrated that their algorithm is fast, stable, and has good convergence properties.

For the second stage, we use β and $Λ (s)$ to impute the infection times conditional on $Z (\cdot)$ . This procedure is performed only for participants who test positive during the observation period. To make this explicit in the notation, let $i^{+} = 1, \dots, n^{+}$ denote the positive participates otherwise $i^{-} = 1, \dots, n^{-}$ , where $n = n^{+} + n^{-} .$ We write the cumulative distribution function of $T_{i^{+}}$ as

F_{i^{+}} (t; Z_{i^{+}}, β) = Pr {T_{i^{+}} < t | Z_{i^{+}} (s), 0 \leq s < t, β} = 1 - e^{- Λ (t; Z_{i^{+}}, β)}

(2)

for

L_{i^{+}}^{<} t \leq R_{i^{+}}^{.}

Specifically,

Λ (t; Z_{i^{+}}, β)

is a participant-specific cumulative hazard function that is a right-continuous, non-decreasing step function having knots at

u_{1} \leq u_{2} \leq \dots \leq u_{m}

with

u_{1} \geq 0

and

u_{m} \leq U

. We differentiate the patient-specific cumulative hazard function

Λ (t; Z_{i^{+}}, β)

from

Λ (s)

, which is the cumulative baseline hazard function. We estimate

F_{i^{+}}

in (2) with

\begin{matrix} {\hat{F}}_{i^{+}} (t; Z_{i^{+}}, \hat{β}) = 1 - e^{- G (\sum_{{j : u_{j} \leq t}} e^{{\hat{β}}^{T} Z_{i^{+}} (u_{j})} [\hat{Λ} (u_{j}) - \hat{Λ} (u_{j - 1})])} \end{matrix}

(3)

where

\hat{β}

is the regression parameter estimate of β and

\hat{Λ} (u_{j})

is the estimate of the cumulative baseline hazard value

Λ (u_{j})

at the j th knot.

\hat{β}

and

\hat{Λ} (u_{j})

are estimated from (1). Remark that

{\hat{F}}_{i^{+}} (t; Z_{i^{+}}, \hat{β}) = {\hat{F}}_{i^{+}} (u_{j}; Z_{i^{+}}, \hat{β})

for any

t \in (u_{j}, u_{j + 1}

]; that is, the estimate in equation (3) is also a right-continuous, non-decreasing step function with knots

u_{1} \leq u_{2} \leq \dots \leq u_{m} .

Denote the set of knots that belong to the censored interval by

ʊ_{i^{+}} = {u_{j}, 1 \leq j \leq m : L_{i^{+}} < u_{j} \leq R_{i^{+}}}

. Then, we sample

T_{i^{+}}^{g}

from the distribution induced by

{\hat{F}}_{i^{+}} (t; Z_{i^{+}}, \hat{β})

ʊ_{i^{+}}

, such that

Pr (T_{=}^{g} u_{j} | \hat{β}) = {\hat{F}}_{i^{+}} (u_{j}; Z_{i^{+}}, \hat{β}) - {\hat{F}}_{i^{+}} (u_{j - 1}; Z_{i^{+}}, \hat{β}) for u_{j} \in ʊ_{i^{+}}

(4)

Thus, the probability of sampling an infection time, $Pr (T_{i^{+}}^{g} = u_{j} | \hat{β})$ , within the censored interval is a function of the jump sizes of the cumulative distribution function, ${\hat{F}}_{i^{+}} (u_{j}; Z_{i^{+}}, \hat{β})$ .

2.1 Simulation study

We undertook a simulation analysis to evaluate the performance of our G-imputation model. Using the methods of Austin²⁰ and Bender et al.,²¹ we generated infection times from three Cox proportional hazards models. Model (1) with two time-independent covariates: $Z_{1} \sim Unif (0, 1)$ and $Z_{2} \sim$ Bernoulli(0.5). Model (2) with one time-dependent covariate only: $Z_{3} (t) = I (t \geq t_{0}) \times Bernoulli (0.3)$ , where $t_{0}$ is the time change from an unexposed to exposed state and $I (\cdot)$ is an indicator function. And Model (3) with one time-independent $(Z_{1} or Z_{2})$ and one time-dependent covariate ( $Z_{3}$ ). We chose an exponential distribution for the infection events with rate $λ = 0.2$ , which resulted in a mean infection time of approximately 4 years since the earliest negative test time.

To simulate a closed incidence cohort, we set $n = 500$ and $K = 10$ for test times $U_{k}$ ( $k = 1, \dots, K$ ). We partitioned the 10-year observation period into one-year intervals with test $U_{k}$ scheduled at the end of the kth year. Let $\tilde{U} = {U_{1}, \dots, U_{K}},$ with $0 = U_{1} < U_{2} < \dots < U_{K}$ forming intervals $(U_{1}, U_{2}], (U_{2}, U_{3}], \dots, (U_{K - 1}, U_{K}]$ . Write $\tilde{δ} = {δ_{1}, \dots, δ_{K}}$ to denote the infection event where $δ_{k} = I (ceiling (T) = U_{k})$ . Under the assumption that participants test at every $U_{k}$ , the full dataset is a random sample of n participants denoted by $D_{f} = (\tilde{U_{i}}, {\tilde{δ}}_{i}, Z_{i})$ with $\tilde{U_{i}} = (U_{i 1}, \dots, U_{i, K_{i}})$ and $\tilde{δ_{i}} = (δ_{i 1}, \dots, δ_{i K_{i}})$ . We refer to $D_{f}$ as the full (or gold-standard) dataset because there is no interval censoring and the infection times are known. We right-censored the data at the infection time ( $T)$ or the latest-negative test time ( $L)$ if the infection time occurred after the observation period (i.e., $T > U_{K})$ . We calculated the true incidence rate for the kth year as: $θ_{k} = \sum_{i = 1}^{n} {\tilde{δ}}_{ik}$ .

In the real-world, participants miss their scheduled test dates. Thus, we induced interval censoring on the infection times by varying the probability p ( $0 \leq p \leq$ 1) of a missed test. Because of $p,$ the tests are missing completely at random (MCAR), and participants will have a random number ( $K_{i})$ of tests with $K_{i} \in {2, \dots, 10}$ . We generated four datasets with a high testing rate ( $0.8 < p \leq 1$ , average censored interval length ≈ 1.5 years), medium testing rate ( $0.6 < p \leq 0.8$ , average censored interval length ≈ 2.0 years), low testing rate (0.4 $< p \leq$ 0.6, average censored interval length ≈ 2.5 years), and a very-low testing rate (0.25 $< p \leq$ 0.4, average censored interval length ≈ 3.5 years). For each interval censored dataset, we obtained the i th participant's latest-negative and earliest-positive test times.

We next imputed the infection times, denoted by $T^{*},$ within the censoring interval, where * represents our G-imputation model ( $g)$ , the mid-point imputation ( $m)$ method, or the single random-point ( $r)$ method. For our G-imputation model, we imputed $T_{i^{+}}^{g}$ as described in equation (4); see also the Supplement. For the mid-point method, we imputed $T_{i^{+}}^{m} = \frac{L_{i^{+}} + R_{i^{+}}}{2}$ . For the random-point method, we imputed a single infection time $T_{i^{+}}^{r}$ from a uniform distribution bounded by $(L_{i^{+}}, R_{i^{+}}]$ . We right-censored the data at the imputed infection times or latest-negative tests and calculated the annual incidence rate: ${\hat{θ}}_{k}^{*} = \sum_{i = 1}^{n} {\tilde{δ}}_{ik}$ .

We undertook the simulation as follows: Step 1: Generate the infection times T and the full dataset, $D_{f} = ({\tilde{U}}_{i}, {\tilde{δ}}_{i}, Z_{i}) .$ Calculate $θ_{k}$ . Step 2: Generate the interval censored datasets $D_{high}$ , $D_{medium}, D_{low}$ , $D_{very - low}$ . Step 3: Impute $T_{i^{+}}^{m}$ once for the mid-point method and obtain ${\hat{θ}}_{k}^{m}$ . Impute $T_{i^{+}}^{g}$ once and obtain ${\hat{θ}}_{k}^{g},$ repeat this step 50 times and obtain the average of the estimates and average of the standard errors. Do the same for $T_{i^{+}}^{r}$ and ${\hat{θ}}_{k}^{r} .$ Step 4: Repeat Steps 1 to 3 a total of 500 times and obtain the empirical average ${\hat{θ}}_{k}^{*}$ and empirical standard deviations $S D_{{\hat{θ}}_{y}^{*}}$ ; also obtain the empirical average of the standard errors $S E_{{\hat{θ}}_{y}^{*}}$ . Calculate the mean-squared error (MSE) for the three imputation approaches and the four interval censored datasets.

2.2 Empirical study

To empirically demonstrate our G-imputation model, we used interval censored data from the Africa Health Research Institute (AHRI), located in the KwaZulu-Natal province of South Africa. Since 2004, field-workers have visited households every 12 months to identify eligible participants for HIV testing. Prior to 2007, women aged 15–49 years and men aged 15–54 years were included in the HIV survey, after which eligibility was extended to cover all residents aged >15 years of age. Approximately 80% of participants agreed to at least one HIV test between 2005 and 2015. Details of the surveillance system and the open HIV incidence cohort are provided elsewhere.²²

We selected sex as a time-independent covariate, and four time-dependent covariates: age (mean-centered), age-squared (mean-centered), marital status, and the HIV prevalence of the participant's surrounding community. We constructed the time-varying community HIV prevalence measure using a geospatial method from previous work.²³ Previous research has shown that the five covariates are strongly associated with the risk of HIV acquisition in our study area.^{12,13,24–26} We obtained estimates of β and $Λ (\cdot)$ with IntCens software.²⁷ After imputing the infection times and right censoring the data, we calculated the incidence rate estimates and their standard errors for each year of the observation period. We used Rubin's rules to obtain the 95% confidence intervals for the G-imputation and the single random-point methods. We undertook all computations in R (version 3.4.4).

3 Results

3.1 Simulation analysis

Table 1 shows the simulation MSE results for the single random-point and mid-point imputation methods. The MSE results are compared with estimates from a G-imputation model with two time-independent covariates (Figure S1), one time-dependent covariate (Figure S2), and one time-dependent and time-independent covariate (Figure 1). Results show that the G-imputation model was more accurate for a high, medium, low, and very-low testing rate, and produced smaller MSEs when compared with the single random-point and mid-point methods (see Figure 1 and Table S1 of the Supplement). The empirical average and empirical standard deviation of the estimates and the empirical average of the standard errors (of the standard error estimators) are shown in Table S1 of the Supplement.

Figure 1.

Simulation study results showing the estimated number of new infections from our G-imputation model with one time-independent and time-dependent covariate. Results are compared with the single random-point and mid-point imputation approaches under a high (80–100%), medium (60–80%), low (40–60%), and very-low (25–40%) testing rate. IC is the average length of the censored interval. (a). High testing rate, IC ≈ 1.5 years. (b) Medium testing rate, IC ≈ 2.0 years. (c) Low testing rate, IC ≈ 2.5 years. (d) Very-low testing rate, IC ≈ 3.5 years.

Table 1.

Mean-squared errors for the simulation study, which compares our G-imputation model with the single random-point and mid-point imputation approaches under a high (80–100%), medium (60–80%), low (40–60%), and very-low (25–40%) testing rate.

Monitoring rate	High	Medium	Low	Very-low
Censored interval length	≈1.5 years	≈2 years	≈2.5 years	≈3.5 years
G-model: two time-independent covariates	0.25	0.54	0.74	1.29
Random-point	1.10	3.51	4.64	5.66
Mid-point	2.83	4.96	7.96	11.19
G-model: one time-dependent covariate	0.15	0.53	0.69	1.18
Random-point	0.90	3.00	4.03	5.01
Mid-point	2.81	4.45	7.37	10.61
G-model: one time-dependent and one time-independent covariate	0.13	0.40	0.76	1.27
Random-point	0.98	3.04	4.09	5.02
Mid-point	2.70	4.50	7.31	10.59

3.2 Empirical analysis

An empirical analysis of the AHRI data shows that there were 2898 (15.5%) seroconversion events for the 18,670 repeat-testers between 2005 and 2016. The observation time was 4.46 (standard deviation [SD]: 3.24) years and the median length of the censored interval was 2.78 (interquartile range (IQR): 1.27–5.04) years. Results for the semi-parametric G-transformation model are shown in Table 2: all covariates except mean-centered age were statistically associated with the time to infection (at the 0.001 level). Figure 2 and Table S2 shows the incidence rate estimates, standard errors, and 95% confidence intervals for the G-imputation model, single random-point, and mid-point imputation methods. The G-imputation model results show that the annual HIV incidence rate increased from 2.62 events per 100 person-years in 2005 to 3.92 events per 100 person-years in 2010, which remained stable at around 4 events per 100 person-years from 2011 to 2016. Estimates from the single random-point method are close to the G-model estimates from 2005 to 2013, with a slight decline in the incidence rate in the last three years of the observation period. The mid-point method concentrates the imputed infection events at the middle of the observation period and therefore shows an increase and then decrease in the incidence rate over time.⁴

Figure 2.

Annual HIV incidence rate (2005–2016) estimates from (a) our G-imputation model, (b) single random-point imputation, and (c) mid-point imputation using interval censored data from the AHRI surveillance area. The solid line represents the smoothed incidence rate estimates with 95% confidence intervals and the crosses are the estimated incidence rates at each year. (a) G-model imputation. (b) Single random-point imputation. (c) Mid-point imputation.

Table 2.

G-transformation model results for the empirical analysis. Exponentiated proportional hazards ratios (HR) and 95% confidence intervals for the risk of HIV acquisition conditional on age, sex, marital status, and community HIV prevalence.

	HR	Lower	Upper	P-value
Age (mean-centered)	0.99	0.92	1.08	0.851
Age-squared (mean-centered)	0.74	0.7	0.78	<0.001
Female vs. male	2.13	1.96	2.32	<0.001
Married vs. unmarried	0.64	0.56	0.72	<0.001
Community HIV prevalence	1.24	1.19	1.3	<0.001

The interval censored data are from a cohort of 18,673 participants who were annually tested for HIV between 2005 and 2016. All covariates are time-dependent except for sex. HRs represent a 10-year increase in age and age-squared and a 10% increase in the HIV prevalence of the participant's surrounding community.

4 Discussion

We have described a multiple imputation approach for estimating the incidence rate with interval censored data and time-dependent covariates. Interval censoring arises when the infection event is known only to occur between the latest-negative and earliest-positive test times, which is an unavoidable problem in incidence cohorts with periodic testing. Previous research has shown that ad hoc methods for imputing the infection time can lead to artefactual trends in the incidence rate once the censoring interval extends across one or more aggregating intervals.⁴ To reduce bias, we propose a two-stage multiple imputation approach: First, we estimate the effects of covariates (time-dependent and/or time-dependent) on the interval censored infection times using a semi-parametric G-transformation model proposed by Zeng et al.¹⁸ Second, we use the estimated regression parameters and the estimated cumulative baseline hazard to derive a unique cumulative distribution function from the participant's covariate values, from which we multiply impute the infection times. As far as we know, our G-imputation model is the first to use one or more time-dependent covariates to estimate trends in the incidence rate.

We undertook a simulation analysis to assess the performance of our G-imputation model. To do this, we generated infection times from covariate data with a 100% testing rate (the gold-standard). We reduced the testing rate to create interval censored data and then imputed the infection times using three G-models with two time-independent covariates, one time-dependent covariate, and one time-dependent and time-independent covariate. For comparison, we imputed mid-point and single random-point infection times and estimated the incidence rates for the three methods. Simulation results show that G-imputation model was accurate for either time-dependent and/or time-independent covariates under a high, medium, low, and very-low testing rate. G-model imputation also produced smaller MSEs than the mid-point and single random-point imputation approaches. The mid-point approach performs poorly when tests are missed; and while single random-point imputation is comparatively more accurate,⁴ it assumes that the hazard of infection is independent of covariates. Previously, Hsu et al.,¹⁵ Pan,¹⁶ and Chen and Sun¹⁷ used a proportional hazards approach to infer the effects of time-independent covariates on the interval censored failure times. Using the recent work of Zeng et al,^18,19 we extend this approach to infer the infection times as a function of time-dependent covariates.

Our G-imputation model treats the unobserved infection time, T, as a missing data point for which a value is imputed conditional on the participant's time-dependent and/or time-independent covariate values, Z. However, from the perspective of the G-transformation model proposed by Zeng et al.,¹⁸ missing data assumptions are not required for T. The G-transformation model assumes that the number of test times, denoted by K, is random and that there exists a random sequence of test times, $U_{1} < \dots < U_{K}$ , and $\tilde{U_{i}} = (U_{i 1}, U_{i 2}, \dots, U_{iK})$ for participants $i = 1, \dots, n$ . Further, it is supposed that ( $\tilde{U}, K)$ are independent of T conditional on Z. Zeng et al.¹⁸ write that the sequence of test times “may not be completely observed and, in fact, need not be for the purpose of inference. We only need to know the values of $L_{i}$ [latest-negative time] and $R_{i}$ [earliest-positive time] since the other [test] times do not contribute to the likelihood.” Nevertheless, assumptions are needed for missing data on $\tilde{U}$ , possibly missing covariate values at L and R, and for missing data due to self-selection into the study cohort.

For our G-imputation approach, we assume that the tests $\tilde{U}$ are missing at random (MAR). By MAR, we mean that the probability of missing data on $\tilde{U}$ is unrelated to $\tilde{U}$ after adjusting for covariates in the analysis.²⁸ This assumption applies to intermittent missingness, when the testing sequence has gaps (as illustrated in section 1), or monotone missingness, when the sequence of tests terminate prematurely (i.e., drop-out), or possibly a mixture of both.^29,30 Covariate data could be associated with patterns of missed tests in several ways. For example, in sub-Saharan Africa, higher HIV acquisition risk is associated with mobile and employed men²⁵ who are more likely to miss their scheduled test dates. Or, young women may drop-out of the study cohort for fear of AIDS-related stigma.³¹ If sex, age, mobility, and stigma covariates exist, then they can explain systematic differences between the observed and missing tests and the data can be treated as missing completely at random (MCAR).^32–34 However, if the pattern of missed tests are beyond the researcher's control, or if the distribution of missingness is unknown, then MAR is only an assumption.³⁵ MAR is often difficult to verify in practice.²⁸ The researcher should therefore carefully review the nature of the missing data and consider the plausibility of missing data assumptions.

Our G-imputation approach requires that covariate values exist at L and R, where (L, R] is the smallest interval that brackets T. If L and R exist, then the G-transformation method proposed by Zeng et al.¹⁸ has several options for extrapolating the time-dependent covariate values across the interval (L, R]. For our analyses, we chose the default extrapolation setting, which is to use the participant's covariate value nearest to $R_{i}$ . However, participants may sometimes refuse to provide their covariate data at the time of the infection test. If the covariate values are MCAR, then listwise deletion (i.e., analysis of the available data only) can be used provided the G-transformation model assumptions are satisfied. The reduced sample will be a random subsample of the original sample with unbiased estimates and slightly larger standard errors.²⁸ Even if the covariates are missing not at random (MNAR), it has been shown, theoretically and by simulation, that listwise deletion can give unbiased estimates for all or some of the parameters.^28,36–38 Alternatively, the next closest L or R with a non-missing covariate value could be selected, but this choice will increase uncertainty about the unobserved infection time⁴ and lead to slightly larger standard errors (see Table S1). Again, the nature of the missing data will need to be carefully reviewed by the researcher.

Missing data on $\tilde{U}$ can also be the result of participants self-selecting into the study cohort. Individuals who are at high risk of infection, for example, may refuse to participate and test for the infection event. Further, the inclusion criteria of two or more tests (with a first negative result) may exclude individuals who are both at a higher risk of infection and unlikely to test regularly. In these two cases, the incidence rate will be biased downward since a non-zero proportion of high-risk individuals are missing from the study cohort. Given self-selection, the analysis of the available data under either MAR or MNAR assumptions could lead to biased estimates.³⁹ The researcher should address potential issues pertaining to selection bias prior to designing the study and collecting the data.³²

Our G-imputation approach has some limitations. First, covariate data are required, which may be unavailable or too costly to collect. In this case, single random-point imputation is a suitable candidate for estimating trends in the incidence rate.⁴ Second, the baseline cumulative distribution function could have large jumps due to data sparseness, which is most likely to occur at the end of the observation period where there is attrition or testing fatigue. As proposed by Hsu et al.,¹⁵ a smoothing function could be fit to the cumulative density function to reduce the jump sizes. Third, our imputation approach depends on assumptions that are relevant to the proportional hazards framework, since the semi-parametric G-transformation model of Zeng et al. (2016) is used to estimate the effects of the covariates on the interval censored infection times. A key assumption is that the model is correctly specified. We advise the careful selection of covariates based on existing knowledge of the study sample or from other study findings undertaken in comparative contexts. In our empirical example, we selected covariates based on previous studies in the AHRI surveillance area showing that age and sex,^12,13,23,25 marital status,^10,25,40 and community HIV prevalence²³ are significantly associated with the time to HIV acquisition (see Table 2).

We acknowledge that several interval censoring methods have been developed within the proportional hazards framework.^41–47 However, there is no clear guidance on how these methods can be used to estimate the incidence rate, and in which situations they should be applied.⁴⁸ We are aware of one method for incidence rate estimation with interval censored data, however, covariate data was not used.⁴⁹ In this study, we have proposed a method for inferring the time to infection conditional on one or more time-dependent covariates. Our G-imputation model therefore contributes to an emerging methodological framework for estimating the incidence rate with interval censored data.

Supplemental Material

Supplemental material for Estimating trends in the incidence rate with interval censored data and time-dependent covariates

Supplemental Material for Estimating trends in the incidence rate with interval censored data and time-dependent covariates by Alain Vandormael, Frank Tanser, Diego Cuadros and Adrian Dobra in Statistical Methods in Medical Research

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by NIH grants (R01HD084233 and R01AI124389) from the National Institute of Child Health and Human Development (NICHD). Funding for the Demographic Surveillance Information System and Population-based HIV Survey was received from the Wellcome Trust, UK. FT and AV were supported by South African MRC Flagship (MRC-RFA-UFSP-01–2013/UKZN HIVEPI). FT was supported by a UK Academy of Medical Sciences Newton Advanced Fellowship (NA1501061).

Supplemental material

Supplemental material for this article is available online.

References

UNAIDS. UNAIDS quarterly update on HIV epidemiology (Q1). Geneva Jt United Nations Program HIV/AIDS, 2010.

Larmarange

Mossong

Bärnighausen

, et al. Participation dynamics in population-based longitudinal HIV surveillance in rural South Africa. PLoS One 2015; 10: e0123345.

Cawley

Wringe

Isingo

, et al. Low rates of repeat HIV testing despite increased availability of antiretroviral therapy in rural Tanzania: findings from 2003–2010. PLoS One 2013; 8: e62212.

Vandormael

Dobra

Bärnighausen

, et al. Incidence rate estimation, periodic testing and the limitations of the mid-point imputation approach. Int J Epidemiol 2018; 47: 236–345.

Lindsey

Ryan

. Methods for interval-censored data. Stat Med 1998; 17: 219–238.

Odell

Anderson

D'Agostino

. Maximum likelihood estimation for interval-censored data using a Weibull-based accelerated failure time model. Biometrics 1992; 48: 951–959.

Dorey

Little

RJA

Schenker

. Multiple imputation for threshold-crossing data with interval censoring. Stat Med 1993; 12: 1589–1603.

Law

Brookmeyer

. Effects of mid-point imputation on the analysis of doubly censored data. Stat Med 1992; 11: 1569–1578.

Jianguo

. The statistical analysis of interval-censored failure time data, Germany: Springer Science and Business Media, 2007.

10.

Barnighausen

Hosegood

Timaeus

, et al. The socioeconomic determinants of HIV incidence: evidence from a longitudinal, population-based study in rural South Africa. AIDS 2007; 21(Suppl 7): S29–38.

11.

Tanser

Bärnighausen

Grapsa

, et al. High coverage of ART associated with decline in risk of HIV acquisition in rural KwaZulu-Natal, South Africa. Science 2013; 339: 966–971.

12.

Vandormael

Newell

M-L

Bärnighausen

, et al. Use of antiretroviral therapy in households and risk of HIV acquisition in rural KwaZulu-Natal, South Africa, 2004–12: a prospective cohort study. Lancet Glob Heal 2014; 2: e209–215.

13.

Akullian

Bershteyn

Klein

, et al. Sexual partnership age-pairings and risk of HIV acquisition in rural South Africa: a population-based cohort study. AIDS 2017; 31: 1755–1764.

14.

Zhang Z, Sun J. Interval censoring. Stat Methods Med Res 2010; 19(1): 53–70.

15.

Hsu

Taylor

JMG

Murray

, et al. Multiple imputation for interval censored data with auxiliary variables. Stat Med 2007; 26: 769–781.

16.

Pan

. A multiple imputation approach to Cox regression with interval-censored data. Biometrics 2000; 56: 199–203.

17.

Chen

Sun

. A multiple imputation approach to the analysis of interval-censored failure time data with the additive hazards model. Comput Stat Data Anal 2010; 54: 1109–1116.

18.

Zeng

Mao

Lin

. Maximum likelihood estimation for semiparametric transformation models with interval-censored data. Biometrika 2016; 103: 253–271.

19.

Zeng

Lin

. Efficient estimation of semiparametric transformation models for counting processes. Biometrika 2006; 93: 627–640.

20.

Austin

. Generating survival times to simulate Cox proportional hazards models with time-varying covariates. Stat Med 2012; 31: 3946–3958.

21.

Bender

Augustin

Blettner

. Generating survival times to simulate Cox proportional hazards models. Stat Med 2005; 24: 1713–1723.

22.

Tanser

Hosegood

Bärnighausen

, et al. Cohort profile: Africa centre demographic information system (ACDIS) and population-based HIV survey. Int J Epidemiol 2008; 37: 956–962.

23.

Tanser

Vandormael

Cuadros

, et al. Effect of population viral load on prospective HIV incidence in a hyper-endemic rural South African community: a population-based cohort study. Sci Transl Med 2017; 9: eaam8012.

24.

Zaidi

Grapsa

Tanser

, et al. Dramatic increases in HIV prevalence after scale-up of antiretroviral treatment: a longitudinal population-based HIV surveillance study in rural KwaZulu-Natal. AIDS 2013; 27: 2301.

25.

Dobra

Bärnighausen

Vandormael

, et al. Space-time migration patterns and risk of HIV acquisition in rural South Africa. AIDS 2017; 31: 137–145.

26.

Tomita

Vandormael

Bärnighausen

, et al. Social disequilibrium and the risk of HIV acquisition: a multilevel study in rural KwaZulu-Natal, South Africa. JAIDS 2017; 75: 164–174.

27.

Zeng D, Mao L, Lin DY. Danyu Lin's Homepage: IntCens. http://dlin.web.unc.edu/software/intcens/ (accessed 22 May 2018).

28.

Paul

. Missing data, London: Sage, 2002.

29.

Diggle

Kenward

. Informative drop-out in longitudinal data analysis. Appl Stat 1994; 43: 49.

30.

Wærsted

Børnick

Twisk

JWR

, et al. Simple descriptive missing data indicators in longitudinal studies with attrition, intermittent missing data and a high number of follow-ups. BMC Res Notes 2018; 11: 123.

31.

Visser

Makin

Vandormael

, et al. HIV/AIDS stigma in a South African community. AIDS Care 2009; 21: 197–206.

32.

Bell

Fairclough

. Practical and statistical issues in missing data for longitudinal patient-reported outcomes. Stat Methods Med Res 2014; 23: 440–459.

33.

Little

RJA

. Modeling the drop-out mechanism in repeated-measures studies. J Am Stat Assoc 1995; 90: 1112–1121.

34.

Bhaskaran

Smeeth

. What is the difference between missing completely at random and missing at random?. Int J Epidemiol 2014; 43: 1336–1339.

35.

Schafer

Graham

. Missing data: our view of the state of the art. Psychol Methods 2002; 7: 147–177.

36.

White

Carlin

. Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values. Stat Med 2010; 29: 2920–2931.

37.

Moreno-Betancur

Lee

Leacy

, et al. Canonical causal diagrams to guide the treatment of missing data in epidemiologic studies. Am J Epidemiol 2018; 187: 2705–2715.

38.

Collins

Schafer

Kam

C-M

. A comparison of restrictive strategies in modern missing data procedures. Psychol Methods 2001; 6: 330–351.

39.

Westreich

. Berkson's bias, and selection bias, and missing data. Epidemiology 2012; 23: 159–164.

40.

Ott

Bärnighausen

Tanser

, et al. Age-gaps in sexual partnerships: seeing beyond ‘sugar daddies’. AIDS 2011; 25: 861–863.

41.

Boruvka

Cook

. A Cox-Aalen model for interval-censored data. Scand J Stat 2015; 42: 414–426.

42.

Seaman

Bird

. Proportional hazards model for interval-censored failure times and time-dependent covariates: application to hazard of HIV infection of injecting drug users in prison. Stat Med 2001; 20: 1855–1870.

43.

Goggins

Finkelstein

. A proportional hazards model for multivariate interval-censored failure time data. Biometrics 2000; 56: 940–943.

44.

Kooperberg

Clarkson

. Hazard regression with interval-censored data. Biometrics 1997; 53: 1485–1494.

45.

Joly

Commenges

Helmer

, et al. A penalized likelihood approach for an illness–death model with interval-censored data: application to age-specific incidence of dementia. Biostatistics 2002; 3: 433–443.

46.

Bebchuk

Betensky

. Multiple imputation for simple estimation of the hazard function based on interval censored data. Stat Med 2000; 19: 405–419.

47.

Sun

Kim

Sun

. Regression analysis of doubly censored failure time data using the additive hazards model. Biometrics 2004; 60: 637–643.

48.

Singh

Totawattage

. The statistical analysis of interval-censored failure time data with applications. Open J Stat 2013; 3: 155–166.

49.

Deng

Diggle

Cheesbrough

. Estimating incidence rates using exact or interval-censored data with an application to hospital-acquired infections. Stat Med 2012; 31: 963–977.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.76 MB