Regression analysis of misclassified current status data with potentially unknown test accuracy

Abstract

Current status data are frequently encountered in many real life cross-sectional epidemiological, demographic, and medical studies, where each subject is examined only once, and the failure time of interest is never exactly observed but known to be either smaller or larger than the examination time for each subject by evaluating the failure status. Consequently, current status data are a mixture of left-censored and right-censored observations for the failure times of all subjects with or without covariates. In some real life studies, the test or diagnosis that is used to determine the failure status may be error-prone, and this leads to misclassified failure status for some or all subjects. The resulting data are referred to as misclassified current status data in the literature. In this paper, we study regression analysis of misclassified current status data and propose a novel estimation approach under the proportional odds model. Specifically, monotone splines are adopted to approximate the baseline odds function, and an efficient expectation-maximization algorithm is developed based on a data augmentation involving exponential and Poisson latent variables. An extension of the proposed method is also developed to account for the unknown test accuracy. The proposed method is shown to have excellent estimation performance in our simulation studies and is illustrated by an application to uterine fibroid data.

Keywords

Current status data EM algorithm misclassification monotone spline proportional odds model

1. Introduction

Current status data frequently arise in cross-sectional studies in many fields such as epidemiology, demography, and clinical trials. In such studies, subjects are observed only once due to the limitation of resource or the study design, and the failure time of interest is never exactly observed but known to be smaller or larger than the examination time depending on the failure status for each subject. Thus, the failure time is either left-censored or right-censored for each subject, and this type of data, a mixture of left-censored or right-censored observations for the failure times, are referred to as current status data or case I interval-censored data in the literature.¹

In addition, in many of such studies, the test used for checking failure status (e.g. disease diagnosis) is imperfect and subject to error, which leads to inaccurate or misclassified failure (e.g. disease) status. For example, the Iowa Hygienic Laboratory collected urine specimens and screened for chlamydia and reported a sensitivity of 0.947 and a specificity of 0.989 for the Aptima Combo 2 Assay.² Similarly, viral disease antibody tests, such as those for HIV or HPV, may yield false-negative results if conducted shortly after infection.³ In our motivating data from the study of Right from the Start (RFTS), pregnant women were given an ultrasound examination for testing the existence of fibroids. In a substudy of RFTS, the sonographers did not receive enough training and missed some fibroids due to inaccurate measuring.^4,5 In all these situations, the failure status is misclassified and inaccurate, leading to misclassified current status data.^3,6,7 This type data contribute additional ambiguity and complication to the conventional current status data and are thus more challenging to analyze. In this article, we study regression analysis of misclassified current status data, with the goal of estimating the covariate effects on the failure time as well as the survival functions for different subgroups.

There are numerous research works on regression analysis of current status data in the literature. Examples of existing works on univariate current status data include Huang⁸ and McMahan et al.⁹ based on the proportional hazards (PH) model, Rossini and Tsiatis¹⁰ and Wang and Dunson⁵ using the proportional odds (PO) model, Lin et al.¹¹ and Martinussen and Scheike¹² under the additive hazards (AH) model, Tian and Cai¹³ under the accelerated failure time model, and Sun and Sun¹⁴ and Lu et al.¹⁵ under the linear transformation models, among many others. There are also many approaches developed for current status data with additional complications, including Lam and Xue¹⁶ and Ma¹⁷ considering a cured subgroup, Zhang et al.¹⁸ and Ma et al.¹⁹ on informative censoring, and Dunson and Dinse²⁰ and Chen et al.²¹ on multiple failure times, among others. More recently, Yu et al.²² proposed a sieve maximum likelihood approach based on Bernstein polynomials for informatively censored current status data, where the failure time and the correlated censoring time are assumed to follow jointly a Copula model and marginally linear transformation models, and Zhang et al.²³ proposed a graphical proportional hazards model with undirected Markov Random Field for left-truncated current status data subject to informative censoring. It is worth noting that most existing approaches for analyzing interval-censored data can be also used to analyze current status data since current status data are just a special case of general interval-censored data.

The research on misclassified current status data is relatively limited. McKeown and Jewell³ was the first to study such data by proposing an adjusted pool-adjacent-violators algorithm for the nonparametric maximum likelihood estimate (NPMLE) in the case of known sensitivity and specificity and further extending their method to handle time-dependent sensitivity and specificity and regression analysis in the presence of covariates. Sal y Rosas and Hughes²⁴ proposed a modified iterative convex minorant (MICM) algorithm for the NPMLE in the one-sample problem, established a hypothesis test based on the MICM estimators for a two-sample problem, and developed an estimation approach for regression analysis of such data under the PH model. Wang and Dunson⁵ developed a fully Bayesian approach for analyzing such data under the PO model and extended their approach to allow an unknown misclassification rate. More recently, Li et al.⁶ studied regression analysis of such data with the linear transformation models, Fang et al.²⁵ under probit model, and Wang et al.⁷ under the AH model, all producing some efficient expectation-maximization (EM) algorithms based on well-calibrated data augmentations. Under the AH model with time-dependent covariates, Li et al.²⁶ introduced a simulation-extrapolation (SIMEX) approach, in which they estimated parameters using simulated data with progressively amplified misclassification levels and then extrapolated the estimated results back to the case without misclassification. Except Wang and Dunson,⁵ all the approaches in the aforementioned papers were developed under the assumption that the test accuracy is known.

In this article, we develop a computationally efficient estimation approach under the PO model. Although the work Li et al.⁶ studied the same topic with the linear transformation model, which takes the PO model as a special case, their work approximates the unspecified baseline odds function with a piecewise constant function, which involves a large number of parameters and thus limits the computational efficiency. In contrast, we adopt monotone splines to approximate the baseline odds function with many fewer parameters, while providing adequate modeling flexibility. Our proposed method via an EM algorithm has great computational advantages, such as being robust to the initial values, fast to converge, easy to implement, and allowing simple and direct calculation of the variance estimate. Moreover, the proposed approach is extended to handle the cases where the test accuracy, such as the sensitivity and specificity, may be unknown. Both versions of the proposed method show excellent performance in our simulation studies, and they are further illustrated in a real application to uterine fibroid data, which motivated our study.

The rest of the paper is organized as follows. Section 2 introduces the data structure, observed likelihood, PO model, and monotone splines. Section 3 presents a data augmentation and the detailed derivation of our EM algorithm when the sensitivity and specificity of the test are known. Section 4 generalizes our method to the case of unknown sensitivity and specificity of the test. Section 5 evaluates and compares the proposed method with some existing methods in a simulation study, and Section 6 provides an illustration with fibroid data analysis. Section 7 gives some discussions.

2. Data, model, and likelihood

2.1. The observed data and likelihood

Consider a cross-sectional study, where each subject is only examined once. Let $T$ denote the failure time of interest, which is never observed directly in the study. Let $C$ denote the examination or censoring time and $X$ a $p \times 1$ vector of covariates. Define $Δ = I (T \leq C)$ be the censoring indicator or the true failure status of $T$ at the examination time $C$ . If $Δ$ were available for all subjects, one would have conventional current status data available for the failure time. Let $S (t | X)$ be the survival function of the failure time $T$ given covariates $X$ . In this article, we assume that the failure time $T$ and censoring time $C$ are conditionally independent given covariates $X$ . Further, the distribution of the censoring time does not contain any parameters associated with the failure time model.

In this article, we consider a general situation where neither the failure time $T$ nor its true status $Δ$ is observed due to the use of an imperfect test or diagnosis. Let $Y$ be the test outcome or the observed failure status at the examination time $C$ . Denoted by $α = P (Y = 1 ∣ Δ = 1)$ and $β = P (Y = 0 ∣ Δ = 0)$ be the sensitivity and specificity of the test, respectively. It is assumed that the observed failure status (i.e. the test result) is independent of covariates and the censoring time given the true failure status. In other words, the sensitivity and specificity of the test are constants and not affected by the covariates and the censoring time.

Suppose that there are $n$ independent subjects in the study, and let ${(T_{i}, C_{i}, X_{i}, Δ_{i}, Y_{i}), i = 1, 2, \dots, n}$ be i.i.d. realizations of $(T, C, X, Δ, Y)$ . Neither the failure times $T_{i}$ ’s nor their true status $Δ_{i}$ ’s are observed, and the observed data are $D = {(Y_{i}, C_{i}, X_{i}), i = 1, 2, \dots, n}$ . This type of survival data are referred to as misclassified current status data in the literature.^6,25 The resulting observed likelihood, based on the observed data $D$ , takes the following form

L_{obs} = \prod_{i = 1}^{n} {α - (α + β - 1) S (C_{i} ∣ X_{i})}^{Y_{i}} {1 - α + (α + β - 1) S (C_{i} ∣ X_{i})}^{1 - Y_{i}}

by dropping some multiplicative terms that involve the densities of

C_{i}

. The two multiplicative terms in this observed likelihood are the two conditional probabilities

P (Y_{i} = 1 ∣ C_{i}, X_{i})

and

P (Y_{i} = 0 ∣ C_{i}, X_{i})

, and their forms are derived using the law of total probability.

2.2. PO model and monotone splines

In this paper, we assume that the failure time follows the PO model motivated by its appealing properties. First, the PO model is a popular and flexible semiparametric model in the survival literature. It allows the baseline odds function to be completely unspecified. Second, the covariates have a proportional effect on the odds of failure at any specific time, and each regression parameter can be interpreted as the log odds ratio of failure due to 1 unit increase in the corresponding covariate while holding all others at fixed levels. This nice interpretation is well accepted by practitioners who are not familiar with survival analysis. Under the PO model, the survival function is

S (t | X) = {1 + Λ_{0} (t) \exp (X^{'} θ)}^{- 1},

where

Λ_{0} (t) = {1 - S (t ∣ X = 0)} / S (t | X = 0)

is the baseline odds function at time

t

The baseline odds function $Λ_{0} (\cdot)$ is an unspecified non-negative non-decreasing function, which contributes a semiparametric nature to the PO model. While this is flexible, its infinite dimension has brought a great challenge in estimation both theoretically and computationally. For the purpose of dimension reduction, we adopt the monotone splines of Ramsay²⁷ for $Λ_{0} (\cdot)$ following McMahan et al.⁹ and Wang and Wang,²⁸

Λ_{0} (t) = \sum_{l = 1}^{L} γ_{l} I_{l} (t),

where

I_{l} (\cdot)

’s are the integrated spline basis functions, and

γ_{l}

’s are non-negative spline coefficients. These spline basis functions are essentially non-decreasing piecewise polynomials ranging from 0 to 1.

To construct the basis functions, one needs to specify the knots and the degree. The degree controls the overall smoothness of the basis functions, with 1 for linear, 2 for quadratic, and 3 for cubic functions. The knots and their placement largely determine the shapes of the basis functions. These basis functions are obtained based on an iterative algorithm once the knots and degree are specified.²⁷ The number of basis functions $L$ is determined by $L = m + d - 2$ , where $m$ is the total number of knots and $d$ is the degree. Intuitively, taking a large number of knots yields more modeling flexibility but may cause overfitting in addition to a greater computational burden.

It is well documented in the literature that the estimation performance of existing works based on monotone splines is generally robust to the number of knots and degree specified for the monotone splines, and such works include Ramsay,²⁷ Lin and Wang,²⁹ McMahan et al.,⁹ Wang et al.,³⁰ and Wang and Wang²⁸ among others. Based on this phenomenon, one can simply use a small or moderate number of knots if an estimation procedure is time-consuming. In particular, Ramsay²⁷ recommended to use only a few knots, for example, use just one knot at the median or three knots at the three quartiles. We recommend to use degree 2 or 3 to ensure adequate smoothness of the target function. Following the recommended strategy in the literature^9,31 among others, we suggest to implement our method multiple times with different spline specifications and then use some model selection criteria such as Akaike information criterion (AIC) or Bayesian information criterion (BIC) to determine the optimal specification for a particular data analysis. This strategy is illustrated in our data analysis in Section 6.

3. The proposed approach

3.1. Data augmentation

In this section, we focus on the situation where the sensitivity ( $α$ ) and specificity ( $β$ ) of the test are known. Let $κ = (θ^{'}, γ^{'})^{'}$ denote the unknown parameters, where $γ = (γ_{1}, γ_{2}, \dots, γ_{L})^{'}$ are the spline coefficients. With the use of monotone splines, the survival function becomes $S_{κ} (t | X) = {1 + \sum_{l = 1}^{L} γ_{l} I_{l} (t) \exp (X^{'} θ)}^{- 1}$ , and the observed likelihood function becomes

L_{obs} (κ) = \prod_{i = 1}^{n} {α - (α + β - 1) S_{κ} (C_{i} | X_{i})}^{Y_{i}} {1 - α + (α + β - 1) S_{κ} (C_{i} | X_{i})}^{1 - Y_{i}} .

(1)

Even though the observed likelihood in (1) contains only a finite number of unknown parameters, direct optimization based on this likelihood encountered numerical problems based on our experience largely due to the non-negative constraints of

γ

and the complex form of the likelihood. To tackle this problem, we introduce the following data augmentation with Exponential and Poisson latent variables and seek to develop a robust estimation approach via an EM algorithm.

The first stage of our data augmentation takes advantage of the relationship between the observed status $Y_{i}$ and the true status $Δ_{i}$ for each $i$ . Treating $Δ_{i}$ ’s as observed, the resulting augmented likelihood is

L_{a u g 1} (κ) = \prod_{i = 1}^{n} {1 - S_{κ} (C_{i} | X_{i})}^{Δ_{i}} {S_{κ} (C_{i} | X_{i})}^{1 - Δ_{i}} P (Y_{i} ∣ Δ_{i}),

(2)

where

P (Y_{i} ∣ Δ_{i}) = {α^{Δ_{i}} (1 - β)^{(1 - Δ_{i})}}^{Y_{i}} {(1 - α)^{Δ_{i}} β^{(1 - Δ_{i})}}^{(1 - Y_{i})}

for

i = 1, \dots, n

. It can be verified that summing the augmented likelihood (2) over possible values of

Δ_{i}

’s leads to the observed likelihood (1). In a special case

α = β = 1

, this likelihood reduces to the observed likelihood for conventional current status data without misspecification. The rest of the data augmentation is motivated by McMahan et al.⁹

In the second stage of our data augmentation, we introduce a latent variable $ξ_{i}$ following an Exponential distribution with mean $1$ for subject $i$ , and the resulting augmented likelihood takes the following form

\begin{aligned} L_{a u g 2} (κ) & = \prod_{i = 1}^{n} {[1 - \exp {- \sum_{l = 1}^{L} γ_{l} I_{l} (C_{i}) \exp (X_{i}^{'} θ) ξ_{i}}]}^{Δ_{i}} \\ \times \exp {- \sum_{l = 1}^{L} γ_{l} I_{l} (C_{i}) \exp (X_{i}^{'} θ) ξ_{i} (1 - Δ_{i})} \exp (- ξ_{i}) P (Y_{i} ∣ Δ_{i}) . \end{aligned}

(3)

Integrating out

ξ_{i}

’s in this augmented likelihood leads to equation (2).

In the third stage, we introduce Poisson latent variables to further simplify the augmented likelihood (3). For subject $i$ , we introduce $Z_{i} \sim P {ξ_{i} \sum_{l = 1}^{L} γ_{l} I_{l} (C_{i}) \exp (X_{i}^{'} θ)}$ and enforce the relationship $Δ_{i} = 1_{(Z_{i} > 0)}$ , where $P (a)$ denotes the Poisson distribution with mean $a$ . Further, we decompose $Z_{i}$ as a sum of conditionally independent Poisson latent variables $Z_{i l}$ ’s such that

Z_{i} = \sum_{l = 1}^{L} Z_{i l}, Z_{i l} \sim P {γ_{l} I_{l} (C_{i}) \exp (X_{i}^{'} θ) ξ_{i}}

for

l = 1, \dots, L

. The resulting augmented likelihood based on

ξ_{i}

’s and

Z_{i l}

’s is

L_{c} (κ) = \prod_{i = 1}^{n} [\prod_{l = 1}^{L} P {Z_{i l} ∣ γ_{l} I_{l} (C_{i}) \exp (X_{i}^{'} θ) ξ_{i}}] \exp (- ξ_{i}) P (Y_{i} ∣ Δ_{i}),

(4)

subject to the constraints

\sum_{l = 1}^{L} Z_{i l} > 0

when

Δ_{i} = 1

and

\sum_{l = 1}^{L} Z_{i l} = 0

when

Δ_{i} = 0

for

i = 1, \dots, n

. Here

P (Z ∣ a)

denotes the probability mass function of Poisson random variable

Z

with mean

a

. The likelihood (4) has an appealing form with multiplicative terms of Poisson probability mass functions and other terms. This likelihood will serve as the complete data likelihood to develop our EM algorithm.

3.2. Derivation of the EM

We now describe the detailed derivation of our EM algorithm. The E-step involves taking the expectation of $\log L_{c} (κ)$ in (4) with respect to all latent variables $Δ_{i}$ ’s, $ξ_{i}$ ’s, and $Z_{i l}$ ’s conditioning on the observed data and the current parameter $κ^{(d)} = (θ^{' (d)}, γ^{' (d)})^{'}$ at $d$ th iteration. This yields the following $Q$ function

Q (κ ∣ κ^{(d)}, D) = \sum_{i = 1}^{n} \sum_{l = 1}^{L} {E (Z_{i l}) (\log γ_{l} + X_{i}^{'} θ) - E (ξ_{i}) γ_{l} I_{l} (C_{i}) \exp (X_{i}^{'} θ)},

after dropping some additive terms that are functions of

κ^{(d)}

but free of

κ

. In the above

Q

function, all the expectations are taken with respect to all latent variables given the observed data

D

, and they are functions of the current parameters

κ^{(d)}

but free of

κ

. These conditional expectations have explicit forms as follows,

\begin{aligned} E (ξ_{i}) & = \frac{E (Z_{i}) + 1}{1 + Λ_{0}^{(d)} (C_{i}) \exp (X_{i}^{'} θ^{(d)})}, \end{aligned}

(5)

\begin{aligned} E (Z_{i}) & = {Λ_{0}^{(d)} (C_{i}) \exp (X_{i}^{'} θ^{(d)}) + 1} E (Δ_{i}), \end{aligned}

(6)

\begin{aligned} E (Z_{i l}) & = \frac{E (Z_{i}) \cdot γ_{l}^{(d)} I_{l} (C_{i})}{\sum_{l = 1}^{L} γ_{l}^{(d)} I_{l} (C_{i})}, \end{aligned}

(7)

\begin{aligned} E (Δ_{i}) & = \frac{Λ_{0}^{(d)} (C_{i}) \exp (X_{i}^{'} θ^{(d)}) α^{Y_{i}} (1 - α)^{1 - Y_{i}}}{Λ_{0}^{(d)} (C_{i}) \exp (X_{i}^{'} θ^{(d)}) α^{Y_{i}} (1 - α)^{1 - Y_{i}} + (1 - β)^{Y_{i}} β^{1 - Y_{i}}} . \end{aligned}

for

i = 1, \dots, n

and

l = 1, \dots, L

In the M-step, we obtain $κ^{(d + 1)}$ by maximizing $Q (κ ∣ κ^{(d)}, D)$ with respect to $κ$ . For this purpose, we first consider the derivatives of $Q (κ ∣ κ^{(d)}, D)$ with respect to $θ$ and $γ_{l}$ ’s. Solving the equations $\partial Q (κ ∣ κ^{(d)}, D) / \partial γ_{l} = 0$ for $l$ = 1, …, $L$ leads to a close-form solution of $γ_{l}^{'}$ s,

γ_{l}^{(d) *} (θ) = \frac{\sum_{i = 1}^{n} E (Z_{i l})}{\sum_{i = 1}^{n} I_{l} (C_{i}) \exp (X_{i}^{'} θ) E (ξ_{i})}, l = 1, \dots, L .

(8)

Replacing

γ_{l}

with

γ_{l}^{(d) *} (θ)

for each

l

\partial Q (κ ∣ κ^{(d)}, D) / \partial θ = 0

leads to the following estimating equation for

θ

\sum_{i = 1}^{n} [E (Z_{i}) - E (ξ_{i}) {\sum_{l = 1}^{L} \frac{I_{l} (C_{i}) \sum_{i^{'} = 1}^{n} E (Z_{i^{'} l})}{\sum_{i^{'} = 1}^{n} I_{l} (C_{i^{'}}) \exp (X_{i}^{'} θ) E (ξ_{i^{'}})}} \exp (X_{i}^{'} θ)] X_{i} = 0 .

(9)

Let

θ^{(d + 1)}

be the solution of the estimating equation (9), and we take

γ_{l}^{(d + 1)} = γ_{l}^{* (d)} (θ^{(d + 1)})

based on equation (8) for

l = 1, \dots, L

. It can be shown that

κ^{(d + 1)} = (θ^{(d + 1)^{'}}, γ^{(d + 1)^{'}})^{'}

is the unique global maximizer of

Q (κ ∣ κ^{(d)}, D)

, and the proof is sketched in Section A of the Supplementary file.

Here is a summary of our proposed EM algorithm. 0.

Initialize $κ^{(d)} = (θ^{{(d)}^{'}}, γ^{{(d)}^{'}})^{'}$ for $d = 0$ .

Obtain $θ^{(d + 1)}$ by solving the estimating equation (9).

Obtain $γ_{l}^{(d + 1)} = γ_{l}^{* (d)} (θ^{(d + 1)})$ based on equation (8), for $l = 1, \dots, L$ . Increase $d$ by $1$ .

Repeat Steps 1 and 2 until convergence.

This EM algorithm has great computational features. First, this algorithm is easy to implement because it involves only solving a low-dimensional system of equations for the regression parameters, which can be done by using the Newton–Ralphson algorithm or an existing statistics package, and updating the spline coefficients in simple explicit forms at each iteration. Second, the EM algorithm is robust to the initial values and converges fast, resulting from the following facts: (a) the conditional expectations have simple closed forms, (b) $κ^{(d + 1)}$ is the unique global maximizer at each EM iteration, and (c) the explicit expressions of $γ_{l}^{(d + 1)}$ ’s automatically satisfy the non-negativity constraints. All these features are resulting from the use of the monotone splines, the proposed data augmentation, and the complete data likelihood with a nice form.

Let $\hat{κ} = ({\hat{θ}}^{'}, {\hat{γ}}^{'})^{'}$ denote the converged value of the EM sequence. The covariance matrix of $\hat{κ}$ can be obtained by taking the inverse of the observed information matrix evaluated as $\hat{κ}$ ,

\hat{var} (\hat{κ}) = (- \frac{\partial^{2} l_{obs} (κ)}{\partial κ \partial κ^{'}} |_{κ = \hat{κ}})^{- 1},

where

l_{obs} (κ)

is the logarithm of the observed likelihood based on the monotone splines in (1). All of the second derivatives of the log-likelihood involved in the observed information matrix have explicit expressions and are presented in Section B.1 of the Supplementary file. This method is easy to implement due to the closed forms of the second derivatives and is shown to have excellent performance in estimating the covariance matrix of

\hat{κ}

4. Extension to handle unknown test accuracy

In the section, we consider the regression problem in the case that the sensitivity and specificity of the imperfect test are unknown. Estimating these quantities is needed and possible when there are validation data or retesting data available.

Motivated by the motivating fibroid data, here we consider a general situation that involves two parts of data, one validation subset using a test with known sensitivity $a_{0}$ and specificity $b_{0}$ and the other subset using an imperfect test with unknown sensitivity $a_{1}$ and specificity $b_{1}$ . Note that the test for the validation data can be a perfect test as in the fibroid data application, and our framework here also allows an imperfect test for the validation data with $a_{0} < 1$ and/or $b_{0} < 1$ .

Keeping the same notations as in Section 3, we further let $G_{i}$ denote the indicator of study group for subject $i$ , with $G_{i} = 0$ if subject $i$ is in the validation subset and $1$ if subject $i$ is in the non-validation subset. For notation purpose, define individual sensitivity $α_{i}$ and specificity $β_{i}$ for subject $i$ in the following manner: $α_{i} = a_{0}$ and $β_{i} = b_{0}$ if $G_{i} = 0$ and $α_{i} = a_{1}$ and $β_{i} = b_{1}$ if $G_{i} = 1$ .

Let $κ = (θ^{'}, γ^{'}, a_{1}, b_{1})^{'}$ denote the new vector of unknown parameters in this situation. Based on the complete data likelihood (4), we obtain the following $Q$ function

\begin{aligned} Q (κ ∣ κ^{(d)}, D) & = \sum_{i = 1}^{n} \sum_{l = 1}^{L} {E (Z_{i l}) (\log γ_{l} + X_{i}^{'} θ) - E (ξ_{i}) γ_{l} I_{l} (C_{i}) \exp (X_{i}^{'} θ)} \\ + \sum_{i = 1}^{n} [E (Δ_{i}) Y_{i} \log α_{i} + E (Δ_{i}) (1 - Y_{i}) \log (1 - α_{i}) \\ + {1 - E (Δ_{i})} (1 - Y_{i}) \log β_{i} + {1 - E (Δ_{i})} Y_{i} \log (1 - β_{i})] \\ = H_{1} (θ, γ ∣ κ^{(d)}) + H_{2} (a_{1}, b_{1} ∣ κ^{(d)}) . \end{aligned}

up to some additive constants that do not involve the parameters of interest

κ

, where

H_{1}

is the first term and is a function of

θ

and

γ

only and does not contain

a_{1}

b_{1}

, while

H_{2}

is a function of

a_{1}

and

b_{1}

only and does not contain

θ

γ

. To maximize the

Q

function with respect to

κ

is equivalent to maximizing

H_{1}

with respect to

θ

and

γ

and maximizing

H_{2}

with respect to

a_{1}

and

b_{1}

due to the mutually exclusive parameter sets of

H_{1}

and

H_{2}

. In the above

Q

function, the conditional expectation

E (Δ_{i})

takes the following form

E (Δ_{i}) = \frac{\sum_{l = 1}^{L} γ_{l}^{(d)} I_{l} (C_{i}) \exp (X_{i}^{'} θ^{(d)}) α_{i}^{Y_{i}} (1 - α_{i})^{1 - Y_{i}}}{\sum_{l = 1}^{L} γ_{l}^{(d)} I_{l} (C_{i}) \exp (X_{i}^{'} θ^{(d)}) {α_{i}}^{Y_{i}} (1 - α_{i})^{1 - Y_{i}} + (1 - β_{i})^{Y_{i}} {β_{i}}^{1 - Y_{i}}}

for subject

i

, and the expressions of

E (ξ_{i})

’s,

E (Z_{i})

’s, and

E (Z_{i l})

’s are the same as in equations (5) to (7) in Section 3.

Since $H_{1}$ has the same form as the $Q$ function in Section 3.2, maximizing $H_{1}$ with respect to $θ$ and $γ$ can be done in the exact manner as in Section 3.2. Let

{a_{1}}^{(d + 1)} = \frac{\sum_{i : G_{i} = 1} E (Δ_{i}) Y_{i}}{\sum_{i : G_{i} = 1} E (Δ_{i})} and {b_{1}}^{(d + 1)} = \frac{\sum_{i : G_{i} = 1} {1 - E (Δ_{i})} (1 - Y_{i})}{\sum_{i : G_{i} = 1} {1 - E (Δ_{i})}} .

(10)

It is straightforward to show that $(a_{1}^{(d + 1)}, b_{1}^{(d + 1)})^{'}$ is the unique global maximizer of $H_{2} (a_{1}, b_{1}, ∣ κ^{(d)})$ and further $κ^{(d + 1)} = (θ^{(d + 1)^{'}}, γ^{(d + 1)^{'}}, a_{1}^{(d + 1)}, b_{1}^{(d + 1)})^{'}$ is the unique global maximizer of $Q (κ ∣ κ^{(d)})$ . Thus, our new EM algorithm in this case only needs to modify the previous EM algorithm in Section 3.2 by adding an extra step of updating $a_{1}^{(d + 1)}$ and $b_{1}^{(d + 1)}$ in closed form based on equation (10) and replacing the global sensitivity $α$ and specificity $β$ with individual sensitivities $α_{i}$ ’s and specificities $β_{i}$ ’s.

Let $\hat{κ}$ denote the converged value of the EM sequence $κ^{(d)}$ ’s. The covariance matrix of $\hat{κ}$ can be estimated by taking the inverse of the observed information matrix evaluated as $\hat{κ}$ , and the detailed expressions of the second derivatives involved in the observed information matrix can be found in Section B.2 of the Supplementary file.

5. Simulation studies

Extensive simulations were conducted to evaluate the proposed method for misclassified current status data with known and unknown test accuracy. In the first simulation study, we evaluated our method when the sensitivity and specificity of the test are known. The survival time $T_{i}$ was generated from the following PO model

S (t ∣ X_{i 1}, X_{i 2}) = {1 + Λ_{0} (t) \exp (θ_{1} X_{i 1} + θ_{2} X_{i 2})}^{- 1}, i = 1, \dots, n,

where

X_{i 1}

is a Bernoulli random variable with a probability of success

0.5

, and

X_{i 2}

is a normal random variable with mean 0 and variance 0.25. The true baseline odds function was set to be

Λ_{0} (t) = t^{3} + t

, and the true values of

θ_{1}

and

θ_{2}

took on

1

- 1

. The censoring time

C_{i}

was generated independently from an exponential distribution with mean 2. The true censoring indicator or the testing status was determined by

Δ_{i} = I (T_{i} \leq C_{i})

. The sensitivity and specificity of the test were specified to be

α = β = 1

0.95

0.90

, and

0.85

, respectively, in four different simulation setups. The observed test result

Y_{i}

was generated from

Bernoulli (α)

Δ_{i} = 1

and from

Bernoulli (1 - β)

Δ_{i} = 0

for a general setting when

α = β < 1

. In total, there were 16 different configurations with different true values of regression parameters and test accuracy. For each setup, we generated 500 datasets with a sample size

n = 500

in each dataset.

In this simulation study, we implemented our method with known accuracy described in Section 3. For the monotone spline specifications, we took the order to be 3 and used 4 equally spaced knots within the minimum and maximum of the censoring times for each data set. The initial values of the regression parameters $θ_{i}$ ’s were generated from $U (- 1, 1)$ , and those of the spline coefficients $γ_{l}$ ’s were generated from $U (0, 1)$ in our simulation. The convergence of the EM algorithm was claimed when the maximum of the absolute changes of regression parameters and the relative change in the observed log-likelihood were both less than $0.0001$ .

For comparison purpose, we also implemented the approach of Li et al.⁶ by running their R code on our simulated data. This comparison is reasonable as their approach is based on a general transformation model, which takes the PO model as a special case. Table 1 presents the simulation results from the two competing methods in the cases of $α = β = 1$ , $0.95$ , $0.90$ , and $0.85$ , respectively. The summarized results include bias (Bias) defined as the average of the 500 point estimates minus the true value, the sample standard deviation (SD) of the 500 point estimates based on the 500 data sets, the estimated standard errors (ESEs) obtained by averaging the 500 estimated standard errors, the mean squared errors (MSEs) based on the 500 point estimates, the 95% coverage probability (CP95) based on the 500 95% Wald confidence intervals for each regression parameter, and the average computation time in seconds per data set across all parameter configurations.

Table 1.
Estimation results on regression parameters from the proposed method and Li et al. (2020) based on 500 data sets each with sample size $n = 500$ when the sensitivity $α$ and $β$ specificity are known.

Proposed method Li et al. (2020)

$θ_{1}$ $θ_{2}$ Est Bias SD ESE MSE CP95 Time Bias SD ESE MSE CP95 Time

$α = β = 1$

−1 −1 ${\hat{θ}}_{1}$ −0.003 0.250 0.251 0.063 0.954 0.73 0.079 0.256 0.241 0.072 0.926 123.25

${\hat{θ}}_{2}$ 0.000 0.252 0.255 0.063 0.950 0.087 0.261 0.245 0.076 0.914

−1 1 ${\hat{θ}}_{1}$ −0.011 0.245 0.275 0.060 0.958 0.75 0.077 0.248 0.240 0.067 0.934 123.23

${\hat{θ}}_{2}$ 0.000 0.246 0.261 0.061 0.966 −0.086 0.259 0.245 0.075 0.924

1 −1 ${\hat{θ}}_{1}$ −0.006 0.219 0.228 0.048 0.966 1.01 −0.046 0.238 0.227 0.059 0.948 20.53

${\hat{θ}}_{2}$ 0.018 0.231 0.232 0.054 0.948 0.064 0.249 0.212 0.066 0.904

1 1 ${\hat{θ}}_{1}$ −0.001 0.232 0.228 0.054 0.958 1.00 −0.034 0.249 0.228 0.063 0.942 21.78

${\hat{θ}}_{2}$ −0.022 0.225 0.232 0.051 0.958 −0.061 0.240 0.213 0.061 0.912

$α = β = 0.95$

−1 −1 ${\hat{θ}}_{1}$ 0.018 0.299 0.297 0.090 0.958 0.92 0.066 0.291 0.285 0.089 0.928 110.47

${\hat{θ}}_{2}$ −0.019 0.305 0.305 0.093 0.966 0.030 0.303 0.292 0.093 0.938

−1 1 ${\hat{θ}}_{1}$ −0.006 0.290 0.299 0.084 0.964 0.92 0.038 0.286 0.287 0.083 0.942 117.27

${\hat{θ}}_{2}$ 0.007 0.310 0.305 0.096 0.952 −0.044 0.299 0.293 0.091 0.940

1 −1 ${\hat{θ}}_{1}$ −0.006 0.268 0.265 0.072 0.956 1.29 −0.031 0.278 0.285 0.078 0.938 16.95

${\hat{θ}}_{2}$ −0.001 0.253 0.270 0.064 0.978 0.030 0.258 0.254 0.068 0.956

1 1 ${\hat{θ}}_{1}$ −0.009 0.265 0.266 0.070 0.944 1.42 −0.034 0.277 0.277 0.078 0.932 17.37

${\hat{θ}}_{2}$ −0.024 0.261 0.271 0.068 0.962 −0.055 0.272 0.249 0.077 0.906

$α = β = 0.90$

−1 −1 ${\hat{θ}}_{1}$ −0.013 0.337 0.356 0.114 0.962 1.17 0.035 0.331 0.324 0.111 0.924 85.35

${\hat{θ}}_{2}$ −0.013 0.363 0.359 0.132 0.950 0.037 0.355 0.335 0.127 0.934

−1 1 ${\hat{θ}}_{1}$ −0.011 0.348 0.349 0.121 0.946 1.17 0.033 0.344 0.324 0.120 0.922 89.71

${\hat{θ}}_{2}$ −0.007 0.357 0.358 0.128 0.960 −0.062 0.347 0.334 0.124 0.934

1 −1 ${\hat{θ}}_{1}$ −0.006 0.293 0.309 0.086 0.960 1.56 −0.024 0.304 0.372 0.093 0.930 16.06

${\hat{θ}}_{2}$ 0.024 0.298 0.315 0.089 0.962 0.046 0.300 0.320 0.092 0.944

1 1 ${\hat{θ}}_{1}$ −0.024 0.293 0.308 0.087 0.964 2.22 −0.035 0.296 0.362 0.089 0.948 14.84

${\hat{θ}}_{2}$ −0.015 0.280 0.314 0.078 0.976 −0.030 0.284 0.315 0.082 0.932

$α = β = 0.85$

−1 −1 ${\hat{θ}}_{1}$ −0.004 0.420 0.417 0.177 0.952 1.39 0.041 0.408 0.361 0.169 0.916 42.08

${\hat{θ}}_{2}$ −0.040 0.415 0.429 0.174 0.966 0.015 0.401 0.379 0.161 0.944

−1 1 ${\hat{θ}}_{1}$ 0.017 0.408 0.413 0.167 0.948 1.67 0.051 0.394 0.364 0.158 0.932 46.61

${\hat{θ}}_{2}$ −0.002 0.417 0.426 0.174 0.940 −0.042 0.406 0.382 0.167 0.924

1 −1 ${\hat{θ}}_{1}$ −0.046 0.344 0.360 0.120 0.954 1.80 −0.052 0.356 0.429 0.129 0.914 13.26

${\hat{θ}}_{2}$ 0.063 0.343 0.363 0.121 0.960 0.072 0.345 0.361 0.124 0.908

1 1 ${\hat{θ}}_{1}$ −0.047 0.338 0.360 0.116 0.948 1.78 −0.061 0.344 0.437 0.122 0.926 13.16

${\hat{θ}}_{2}$ −0.031 0.340 0.365 0.117 0.970 −0.048 0.339 0.368 0.117 0.942

			Proposed method	Li et al. (2020)
$α = β = 1$
−1	−1	${\hat{θ}}_{1}$	−0.003	0.250	0.251	0.063	0.954	0.73	0.079	0.256	0.241	0.072	0.926	123.25
		${\hat{θ}}_{2}$	0.000	0.252	0.255	0.063	0.950		0.087	0.261	0.245	0.076	0.914
−1	1	${\hat{θ}}_{1}$	−0.011	0.245	0.275	0.060	0.958	0.75	0.077	0.248	0.240	0.067	0.934	123.23
		${\hat{θ}}_{2}$	0.000	0.246	0.261	0.061	0.966		−0.086	0.259	0.245	0.075	0.924
1	−1	${\hat{θ}}_{1}$	−0.006	0.219	0.228	0.048	0.966	1.01	−0.046	0.238	0.227	0.059	0.948	20.53
		${\hat{θ}}_{2}$	0.018	0.231	0.232	0.054	0.948		0.064	0.249	0.212	0.066	0.904
1	1	${\hat{θ}}_{1}$	−0.001	0.232	0.228	0.054	0.958	1.00	−0.034	0.249	0.228	0.063	0.942	21.78
		${\hat{θ}}_{2}$	−0.022	0.225	0.232	0.051	0.958		−0.061	0.240	0.213	0.061	0.912
$α = β = 0.95$
−1	−1	${\hat{θ}}_{1}$	0.018	0.299	0.297	0.090	0.958	0.92	0.066	0.291	0.285	0.089	0.928	110.47
		${\hat{θ}}_{2}$	−0.019	0.305	0.305	0.093	0.966		0.030	0.303	0.292	0.093	0.938
−1	1	${\hat{θ}}_{1}$	−0.006	0.290	0.299	0.084	0.964	0.92	0.038	0.286	0.287	0.083	0.942	117.27
		${\hat{θ}}_{2}$	0.007	0.310	0.305	0.096	0.952		−0.044	0.299	0.293	0.091	0.940
1	−1	${\hat{θ}}_{1}$	−0.006	0.268	0.265	0.072	0.956	1.29	−0.031	0.278	0.285	0.078	0.938	16.95
		${\hat{θ}}_{2}$	−0.001	0.253	0.270	0.064	0.978		0.030	0.258	0.254	0.068	0.956
1	1	${\hat{θ}}_{1}$	−0.009	0.265	0.266	0.070	0.944	1.42	−0.034	0.277	0.277	0.078	0.932	17.37
		${\hat{θ}}_{2}$	−0.024	0.261	0.271	0.068	0.962		−0.055	0.272	0.249	0.077	0.906
$α = β = 0.90$
−1	−1	${\hat{θ}}_{1}$	−0.013	0.337	0.356	0.114	0.962	1.17	0.035	0.331	0.324	0.111	0.924	85.35
		${\hat{θ}}_{2}$	−0.013	0.363	0.359	0.132	0.950		0.037	0.355	0.335	0.127	0.934
−1	1	${\hat{θ}}_{1}$	−0.011	0.348	0.349	0.121	0.946	1.17	0.033	0.344	0.324	0.120	0.922	89.71
		${\hat{θ}}_{2}$	−0.007	0.357	0.358	0.128	0.960		−0.062	0.347	0.334	0.124	0.934
1	−1	${\hat{θ}}_{1}$	−0.006	0.293	0.309	0.086	0.960	1.56	−0.024	0.304	0.372	0.093	0.930	16.06
		${\hat{θ}}_{2}$	0.024	0.298	0.315	0.089	0.962		0.046	0.300	0.320	0.092	0.944
1	1	${\hat{θ}}_{1}$	−0.024	0.293	0.308	0.087	0.964	2.22	−0.035	0.296	0.362	0.089	0.948	14.84
		${\hat{θ}}_{2}$	−0.015	0.280	0.314	0.078	0.976		−0.030	0.284	0.315	0.082	0.932
$α = β = 0.85$
−1	−1	${\hat{θ}}_{1}$	−0.004	0.420	0.417	0.177	0.952	1.39	0.041	0.408	0.361	0.169	0.916	42.08
		${\hat{θ}}_{2}$	−0.040	0.415	0.429	0.174	0.966		0.015	0.401	0.379	0.161	0.944
−1	1	${\hat{θ}}_{1}$	0.017	0.408	0.413	0.167	0.948	1.67	0.051	0.394	0.364	0.158	0.932	46.61
		${\hat{θ}}_{2}$	−0.002	0.417	0.426	0.174	0.940		−0.042	0.406	0.382	0.167	0.924
1	−1	${\hat{θ}}_{1}$	−0.046	0.344	0.360	0.120	0.954	1.80	−0.052	0.356	0.429	0.129	0.914	13.26
		${\hat{θ}}_{2}$	0.063	0.343	0.363	0.121	0.960		0.072	0.345	0.361	0.124	0.908
1	1	${\hat{θ}}_{1}$	−0.047	0.338	0.360	0.116	0.948	1.78	−0.061	0.344	0.437	0.122	0.926	13.16
		${\hat{θ}}_{2}$	−0.031	0.340	0.365	0.117	0.970		−0.048	0.339	0.368	0.117	0.942

As seen in Table 1, our proposed method yields substantially smaller bias than the approach of Li et al.⁶ in all parameter configurations, while the latter produces a smaller MSE than the former in most configurations. However, careful investigation suggests that the ESEs from the approach of Li et al.⁶ may actually underestimate the true standard errors because their ESEs are often smaller than their corresponding SDs and their resulting CP95s are substantially smaller than the nominal value 0.95, especially in the cases of $α = β = 1$ . In contrast, for our method, the ESEs are well aligned with the SDs, and the CP95s are all close to 0.95 in all configurations. The comparison in terms of MSE, which takes into both bias and variance, also favors our proposed method slightly over the approach of Li et al.⁶ in Table 1. In addition, the proposed method is more computationally efficient than the approach of Li et al.,⁶ with an overall average running time of 1.30 seconds per data set for the proposed method and 54.49 seconds per data set for the approach of Li et al.⁶ in this simulation. Based on all these comparison results, it is fair to conclude that the proposed method outperforms the approach of Li et al.⁶ in terms of overall estimation accuracy and computational efficiency. All the R codes were run on a MacOS system with Apple M2 chip and 16 GB memory.

Our method is also producing good estimation results for the baseline cumulative distribution function (CDF) $F_{0}$ . Figure 1 presents the true and estimated baseline CDF over the same range (0.1, 2.5) across all parameter configurations in the first simulation study. These configurations allow for a progressive examination of how increasing levels of misclassification impact the estimation performance of the proposed method. The plots display an overall good agreement between the estimated and true baseline CDFs across all four configurations of $(θ_{1}, θ_{2})$ in all the panels of Figure 1. It is also clear in Figure 1 that the estimated baseline CDF shows a slightly larger deviation from the true curve as the test accuracy decreases.

Figure 1.

True and four estimated baseline cumulative distribution functions across different parameter configurations of $(θ_{1}, θ_{2})$ in different scenarios with different values of sensitivity $α$ and specificity $β$ .

An additional simulation study was conducted to compare the adopted variance estimation method with Louis’ method³² and a method based on outer product gradients,³³ and the comparison results presented in Section C.1 of the Supplementary file justify the validity and the efficiency the proposed method in obtaining the variance estimate of $\hat{κ}$ .

In the second simulation study, we evaluated the estimation performance of the proposed method when the test accuracy parameters were unknown. For this simulation, we considered mixed data, with one subset of validation data and the other set of non-validation data, for each dataset. Specifically, the validation data contained of 250 observations from a test with known sensitivity $a_{0}$ and specificity $b_{0}$ , and the non-validation data contain 250 observations from an imperfect test with sensitivity $a_{1}$ and specificity $b_{1}$ in each dataset. Three scenarios were considered for different accuracies of the test for the validation data: (1) a perfect test with $a_{0} = b_{0} = 1$ , (2) an imperfect test with $a_{0} = b_{0} = 0.95$ , and (3) an imperfect test with $a_{0} = b_{0} = 0.85$ . In each scenario, we further considered three cases with $a_{1} = b_{1} = 0.95$ , $0.90$ , and $0.85$ for the non-validation data. The data generation mechanism was the same as the first simulation study except for the mixture feature, and the proposed method with unknown sensitivity and specificity described in Section 4 was implemented, with other features such as monotone spline specifications, initial values, and convergence criteria the same as in the first simulation study.

Tables 2 and 3 present the estimation results based on the mixed data in the cases of $a_{0} = b_{0} = 1$ and $a_{0} = b_{0} = 0.85$ , respectively, and the results for the case of $a_{0} = b_{0} = 0.95$ are summarized in Table C.2 of the Supplementary file. Examining the results in these tables, we conclude that the proposed method works well in estimating the regression parameters and the test accuracy parameters since the biases are close to $0$ , the SDs are closely aligned with the ESEs, and the CP95s are close to the nominal value 0.95 for all parameters in all configurations. It is observed the estimated standard errors are getting larger when the test accuracy decreases for the non-validation data within each of these three tables, which is expected due to the increased level of uncertainty. Comparing the results across these tables, the estimated standard errors are also getting larger as the test accuracy for the validation data decreases, which is also reasonable.

Table 2.

Estimation results from the proposed method with unknown sensitivity and specificity based on mixed data in the second simulation study when the test of the validation data is perfect with $a_{0} = b_{0} = 1$ and the true sensitivity and specificity of the test for non-validation data are $a_{1} = b_{1} = 0.95$ , $0.90$ , and $0.85$ .

			$a_{1} = b_{1} = 0.95$				$a_{1} = b_{1} = 0.90$				$a_{1} = b_{1} = 0.85$
$θ_{1}$	$θ_{2}$	Est	Bias	SD	ESE	CP95	Bias	SD	ESE	CP95	Bias	SD	ESE	CP95
−1	−1	${\hat{θ}}_{1}$	−0.018	0.288	0.289	0.940	−0.009	0.294	0.298	0.954	0.003	0.305	0.308	0.934
		${\hat{θ}}_{2}$	−0.014	0.303	0.293	0.946	−0.010	0.305	0.302	0.964	−0.023	0.315	0.316	0.958
		${\hat{a}}_{1}$	0.000	0.068	0.121	0.984	0.012	0.086	0.115	0.982	0.014	0.102	0.112	0.962
		${\hat{b}}_{1}$	−0.002	0.034	0.035	0.964	0.002	0.039	0.039	0.936	0.001	0.042	0.041	0.930
−1	1	${\hat{θ}}_{1}$	−0.002	0.268	0.286	0.974	0.011	0.291	0.298	0.960	−0.006	0.296	0.313	0.946
		${\hat{θ}}_{2}$	0.005	0.276	0.289	0.970	−0.007	0.312	0.303	0.940	−0.006	0.320	0.318	0.950
		${\hat{a}}_{1}$	−0.006	0.074	0.115	0.974	0.011	0.087	0.117	0.974	0.010	0.096	0.113	0.972
		${\hat{b}}_{1}$	0.000	0.032	0.035	0.968	0.001	0.036	0.039	0.962	0.001	0.042	0.041	0.938
1	−1	${\hat{θ}}_{1}$	−0.002	0.243	0.258	0.968	−0.021	0.263	0.268	0.956	−0.006	0.257	0.278	0.964
		${\hat{θ}}_{2}$	−0.002	0.258	0.262	0.956	0.029	0.265	0.272	0.940	0.025	0.293	0.283	0.940
		${\hat{a}}_{1}$	0.010	0.049	0.077	0.996	0.014	0.066	0.078	0.974	0.019	0.071	0.077	0.970
		${\hat{b}}_{1}$	−0.004	0.041	0.048	0.980	0.000	0.048	0.049	0.944	0.005	0.054	0.053	0.926
1	1	${\hat{θ}}_{1}$	−0.015	0.243	0.256	0.952	−0.027	0.253	0.265	0.962	−0.020	0.268	0.277	0.956
		${\hat{θ}}_{2}$	0.002	0.247	0.261	0.962	−0.008	0.270	0.272	0.952	−0.043	0.278	0.281	0.944
		${\hat{a}}_{1}$	0.010	0.050	0.076	0.980	0.017	0.068	0.078	0.964	0.018	0.071	0.076	0.964
		${\hat{b}}_{1}$	−0.003	0.041	0.047	0.980	0.003	0.047	0.050	0.962	0.007	0.054	0.053	0.940

Table 3.

Estimation results from the proposed method with unknown sensitivity and specificity based on mixed data in the second simulation study when the test of the validation data is imperfect with sensitivity and specificity $a_{0} = b_{0} = 0.85$ and the true sensitivity and specificity of the test for non-validation data are $a_{1} = b_{1} = 0.95$ , $0.90$ , and $0.85$ , respectively.

			$a_{1} = b_{1} = 0.95$				$a_{1} = b_{1} = 0.9$				$a_{1} = b_{1} = 0.85$
$θ_{1}$	$θ_{2}$	Est	Bias	SD	ESE	CP95	Bias	SD	ESE	CP95	Bias	SD	ESE	CP95
−1	−1	${\hat{θ}}_{1}$	−0.007	0.358	0.385	0.960	0.023	0.373	0.415	0.968	0.022	0.440	0.446	0.934
		${\hat{θ}}_{2}$	−0.017	0.356	0.396	0.970	−0.019	0.403	0.427	0.948	0.001	0.404	0.454	0.958
		${\hat{a}}_{1}$	0.006	0.068	0.144	0.982	0.021	0.092	0.146	0.968	0.028	0.104	0.138	0.952
		${\hat{b}}_{1}$	0.001	0.034	0.043	0.990	0.003	0.041	0.045	0.954	0.006	0.047	0.048	0.932
−1	1	${\hat{θ}}_{1}$	−0.028	0.357	0.397	0.956	−0.004	0.393	0.410	0.948	0.011	0.411	0.444	0.962
		${\hat{θ}}_{2}$	0.040	0.375	0.408	0.954	−0.031	0.402	0.415	0.942	0.006	0.457	0.452	0.954
		${\hat{a}}_{1}$	0.003	0.070	0.147	0.990	0.023	0.086	0.143	0.976	0.029	0.109	0.139	0.960
		${\hat{b}}_{1}$	−0.001	0.034	0.042	0.986	0.005	0.041	0.046	0.966	0.008	0.046	0.050	0.964
1	−1	${\hat{θ}}_{1}$	−0.025	0.294	0.337	0.968	−0.059	0.321	0.360	0.950	−0.022	0.359	0.394	0.962
		${\hat{θ}}_{2}$	0.018	0.309	0.343	0.972	0.068	0.322	0.364	0.950	0.003	0.358	0.396	0.966
		${\hat{a}}_{1}$	0.022	0.045	0.101	0.992	0.032	0.069	0.100	0.978	0.029	0.082	0.096	0.962
		${\hat{b}}_{1}$	−0.002	0.039	0.057	0.986	0.008	0.054	0.060	0.968	0.007	0.062	0.062	0.928
1	1	${\hat{θ}}_{1}$	−0.026	0.302	0.339	0.950	−0.034	0.322	0.366	0.942	−0.050	0.326	0.382	0.956
		${\hat{θ}}_{2}$	−0.026	0.312	0.347	0.950	−0.078	0.334	0.375	0.934	−0.029	0.354	0.389	0.956
		${\hat{a}}_{1}$	0.023	0.047	0.098	0.994	0.033	0.067	0.106	0.974	0.034	0.079	0.098	0.972
		${\hat{b}}_{1}$	0.002	0.042	0.059	0.990	0.006	0.049	0.061	0.976	0.009	0.056	0.061	0.958

An additional simulation study was conducted to assess the importance of the validation data when estimating the test accuracy parameters. Tables C.3 and C.4 in the supplementary file show the estimation results when the test is perfect for the validation data and the proportion of validation data is 30% and 0% of the whole data, respectively. It is found that the estimation performance deteriorates as the proportion of the validation data decreases. In particular, when there are no validation data available, the estimation results are not satisfactory, with substantial bias, misaligned SD and ESE, and/or low coverage probability, for the regression parameters and the test accuracy parameters in some configurations of Table C.4. This suggests that one cannot estimate the test accuracy parameters accurately without the validation data, especially when the test accuracy is low. This makes sense as the test accuracy usually depends only on the used technology or the features of the test itself.

6. Fibroid data application

RFTS was a multi-center cohort study of early pregnancy conducted in North Carolina, Tennessee, and Texas initiated in 2001. Enrollment criteria required young women participants to be at least 18 years, enroll by 12 6/7 weeks of gestation based on the last menstrual period, not use assisted reproductive technology, intend to carry the pregnancy to term, speak English or Spanish, and not plan to move for the next 18 months in the literature.⁴ Our data contain 4496 women during the period 2001−2007, and they all completed an endovaginal ultrasound examination during their pregnancy. In the ultrasound examination, each identified leiomyoma was measured three times, and its maximum size was recorded.⁴ A woman was classified as fibroid positive if she had any detected leiomyoma greater than 5 millimeters. This study involved 3 separately funded sub-studies, RFTS 1, 2, and 3. It was reported that the ultrasonographers in RFTS 1 did not receive adequate training and likely missed some fibroids, leading to an under-reporting problem.^4,5 The under-reporting problem only occurred in RFTS 1 and not in the other two sub-studies.

In our analysis, the response variable is taken to be the onset time of fibroid, and we are interested in estimating the covariate effects on the onset time as well as the survival functions of the onset times for different subgroups. Four covariates are considered in this analysis: race (White or Black), parity status (whether a participant has given birth before), age of menarche (when a participant had her first period), and obesity status (measured by body mass index greater than 30). In the data, the onset time of fibroid was never observed for any participant. However, the fibroid status, whether there was at least one fibroid detected at the ultrasound examination, was recorded, and this indicates that the fibroid onset time is either left- or right-censored at the examination time. Due to the under-reporting problem, the fibroid status was subject to mis-specification in the first sub-study, leading to misclassified current status data for the fibroid onset time. The accuracy of the test/diagnosis is unknown and needs to be estimated in this analysis.

Let $T_{i}$ be the fibroid onset time, $C_{i}$ the age at the ultrasound examination, $Y_{i}$ the reported binary fibroid status, and $Δ_{i}$ the true fibroid status for subject $i$ for $i = 1, \dots, n$ . Let $G_{i}$ be the binary indicator whether subject $i$ participated in RFTS 1. Since there was no misclassification problem in RFTS 2 and 3, we have $Δ_{i} = Y_{i}$ for all subjects with $G_{i} = 0$ . Consequently, the sensitivity $a_{0}$ and specificity $b_{0}$ of the test in both RFTS 2 and 3 are equal to one. However, for the subjects in RFTS 1, their fibroid statuses are potentially misclassified because of the under-reporting problem. Following the notations in Section 4, let $a_{1} = P (Y_{i} = 1 | Δ_{i} = 1, G_{i} = 1)$ be the sensitivity of the ultrasound examination in RFTS 1, and it is anticipated that $a_{1}$ is less than $1$ and needs to be estimated. Note that the specificity of the test in RFTS 1 is $b_{1} = P (Y_{i} = 0 | Δ_{i} = 0, G_{i} = 1) = 1$ since there was no misclassification in that case.

Let $X_{i} = (X_{i 1}, X_{i 2}, X_{i 3}, X_{i 4})$ be the covariate vector for subject $i$ , where $X_{i 1}$ is a binary race variable with 1 for African American and 0 for white, $X_{i 2}$ denotes parity with 1 for having given birth before and 0 otherwise, $X_{i 3}$ is the standardized age of menarche, and $X_{i 4}$ is the obesity status with 1 for the body mass index greater than 30 and 0 otherwise. The following PO model was assumed in the analysis

S (t | X_{i 1}, X_{i 2}, X_{i 3}, X_{i 4}) = {1 + Λ_{0} (t) \exp (θ_{1} X_{i 1} + θ_{2} X_{i 2} + θ_{3} X_{i 3} + θ_{4} X_{i 4})}^{- 1} .

The proposed method in Section 4 was applied to this dataset, taking

m \in {3, 4, \dots, 12}

equally spaced knots within the range of the examination times and degree to be 3 for the monotone spline specification. In the implementation of the proposed method, we took random initial values in the same manner as in the simulation study and set the convergence criterion to be the maximum change of all parameters less than

10^{- 4}

Table 4 presents the estimated regression and sensitivity parameters (Est) and their corresponding standard errors (SE) as well as the AIC from the proposed method using different number $m$ of equally spaced knots. As seen in Table 4, the estimation results are very close if not identical across different $m$ values, suggesting that the proposed method has a robust estimation performance to the number of knots. This phenomenon is also observed in other literature works using monotone splines, such as McMahan et al.⁹ and Wang and Wang²⁸ among others. It is also observed that the PO model with $m = 3$ equally spaced knots yields the smallest AIC and is thus considered the optimal and final model for our analysis. Table C.5 in the Supplementary file presents the estimation results from the proposed method when using different sets of random initial values, and the estimation results are nearly identical across different trials, which further confirms the robustness of our method to the initial values. Table C.6 in the Supplementary file shows the estimation results from the proposed method treating both the sensitivity $a_{1}$ and specificity $b_{1}$ of the test in RFTS 1 as unknown, and the estimated specificity is ${\hat{b}}_{1} = 0.997$ , justifying our setting of $b_{1} = 1$ in the current data analysis.

Table 4.
Estimated covariate effects (Est) and their corresponding standard errors (SE) from the proposed method 1 using different number (m) of equally spaced knots.

Race Parity AgeMenarche Obesity $a_{1}$

m Est SE Est SE Est SE Est SE Est SE AIC

3 1.580 0.149 $- 0.395$ 0.125 $- 0.190$ 0.062 0.091 0.150 0.636 0.061 2128.26

4 1.579 0.150 $- 0.395$ 0.126 $- 0.192$ 0.062 0.102 0.151 0.629 0.059 2129.24

5 1.579 0.150 $- 0.397$ 0.126 $- 0.190$ 0.062 0.088 0.150 0.638 0.063 2132.29

6 1.575 0.154 $- 0.395$ 0.124 $- 0.191$ 0.062 0.104 0.150 0.630 0.054 2134.02

7 1.582 0.151 $- 0.393$ 0.125 $- 0.190$ 0.062 0.099 0.151 0.635 0.058 2136.25

8 1.586 0.153 $- 0.392$ 0.126 $- 0.191$ 0.062 0.105 0.151 0.634 0.060 2137.63

9 1.573 0.149 $- 0.378$ 0.128 $- 0.189$ 0.062 0.108 0.150 0.633 0.061 2141.17

10 1.577 0.148 $- 0.384$ 0.128 $- 0.191$ 0.062 0.110 0.151 0.626 0.061 2141.76

11 1.585 0.145 $- 0.387$ 0.129 $- 0.191$ 0.062 0.100 0.151 0.639 0.061 2142.70

12 1.584 0.145 $- 0.391$ 0.130 $- 0.191$ 0.062 0.097 0.151 0.633 0.061 2145.26

	Race	Parity	AgeMenarche	Obesity	$a_{1}$
3	1.580	0.149	$- 0.395$	0.125	$- 0.190$	0.062	0.091	0.150	0.636	0.061	2128.26
4	1.579	0.150	$- 0.395$	0.126	$- 0.192$	0.062	0.102	0.151	0.629	0.059	2129.24
5	1.579	0.150	$- 0.397$	0.126	$- 0.190$	0.062	0.088	0.150	0.638	0.063	2132.29
6	1.575	0.154	$- 0.395$	0.124	$- 0.191$	0.062	0.104	0.150	0.630	0.054	2134.02
7	1.582	0.151	$- 0.393$	0.125	$- 0.190$	0.062	0.099	0.151	0.635	0.058	2136.25
8	1.586	0.153	$- 0.392$	0.126	$- 0.191$	0.062	0.105	0.151	0.634	0.060	2137.63
9	1.573	0.149	$- 0.378$	0.128	$- 0.189$	0.062	0.108	0.150	0.633	0.061	2141.17
10	1.577	0.148	$- 0.384$	0.128	$- 0.191$	0.062	0.110	0.151	0.626	0.061	2141.76
11	1.585	0.145	$- 0.387$	0.129	$- 0.191$	0.062	0.100	0.151	0.639	0.061	2142.70
12	1.584	0.145	$- 0.391$	0.130	$- 0.191$	0.062	0.097	0.151	0.633	0.061	2145.26

The results also include the estimation of the sensitivity $a_{1}$ and the AIC in these cases.

Table 5 presents the estimation results from the proposed method with the optimal spline specification together with the results from two competing approaches: Wang and Dunson⁵ and McMahan et al.,⁹ using the same spline specification under the PO model. Wang and Dunson⁵ accounted for the under-reporting problem in the data analysis but from a Bayesian perspective. The approach of McMahan et al.⁹ is based on an EM algorithm for analyzing conventional current status data under the PO model but cannot be used to analyze misclassified current status data. Here the results from McMahan et al.⁹ were actually obtained based on the data in RFTS 2 only, where there were no misclassifications. That approach is equivalent to the proposed method with both sensitivity and specificity equal to $1$ in Section 3.

Table 5.

Estimation results of fibroid data analysis from the proposed EM method, the Bayesian approach Wang and Dunson⁵ (Wang & Dunson), and the approach of McMahan et al.⁹ (MWT) under the PO model.

	Proposed		Wang & Dunson		MWT
Covariate	Point	95% CI	Point	95% CI	Point	95% CI
Race	1.580	(1.286, 1.872)	1.446	(1.159,1.750)	1.605	(1.211, 2.000)
Parity	−0.395	(−0.641, −0.150)	−0.544	(−0.776, −0.315)	−0.340	(−0.660, −0.021)
AgeMenarche	−0.190	(−0.312, −0.068)	−0.182	(−0.297, −0.064)	−0.168	(−0.321, −0.015)
Obesity	0.091	(−0.203, 0.384)	0.035	(−0.254, 0.320)	0.119	(−0.275, 0.513)
$a_{1}$	0.636	(0.516,0.755)	0.595	(0.488, 0.710)

Presented results include the point estimates (MLEs or posterior means) of the covariate effects and sensitivity $a_{1}$ and their corresponding 95% confidence or credible intervals (95% CI). The results of MWT were based on the data in RFTS 2 only.

As seen in Table 5, these three competing methods yield comparable estimation results and lead to the same conclusions. Three covariates, race, parity, and age of menarche, have a significant effect on the development of fibroids, while obesity does not. To be specific, black women have a higher risk of developing uterine fibroids than white women races, and the odds of developing fibroids for black women is about $e^{1.580} \approx 4.85$ times as that for white women when controlling other covariates. Women who have given birth before have about $1 - e^{- 0.395} \approx 32.6 %$ lower risk of developing fibroids than those women who have never given birth before when controlling other covariates. Additionally, the estimated sensitivity of ultrasound examination is 0.636 from our analysis, which confirms a serious under-reporting issue in RFTS 1.

Figure 2 plots the estimated cumulative incidence functions together with their 95% pointwise confidence bands for black and white women who had their menarche at 12.6 years old, never gave birth before, and were non-obese (i.e. with $X_{2} = X_{3} = X_{4} = 0$ ). In calculating the pointwise the confidence bands, we computed the variance estimates of $\hat{F} (t | X)$ based on the estimated covariance matrix $\hat{var} (\hat{κ})$ using the delta method for time $t$ and covariates $X$ . It is clear from Figure 2 that the African American women have a significantly higher risk of developing fibroids than the corresponding white women with the same conditions.

Figure 2.

Estimates of the cumulative incidence of fibroids as well as their 95% Wald confidence interval bands for African American women (dashed) and for white women (solid) who had their menarche at 12.6 years old, never gave birth before, and were non-obese.

7. Discussion

This paper proposes a new estimation approach for analyzing misclassified current status data under the popular PO model. Specifically, the adoption of monotone splines for the baseline odds function significantly reduces the number of parameters while maintaining adequate flexibility. A computationally efficient EM algorithm is developed based on a novel data augmentation with Exponential and Poisson latent variables. Our EM algorithm is shown to be robust to the initial values, easy to implement, and converges fast. Our method is further extended to allow unknown sensitivity and specificity of the test or diagnosis when there are additional validation data available, for which the associated test accuracy is known. The proposed method can accommodate the scenarios that the test for the validation data is either perfect or imperfect, making it more practical for real life testing data. The proposed method has demonstrated excellent estimation performance in the both cases of known and unknown test accuracy in our simulation. A real application to uterine fibroid data is used to illustrate the proposed method in different scenarios. The R code for the proposed method is available upon request.

One potential limitation of our method is to require the existence of validation data when the test accuracy is unknown. When there are no validation data, as reported in Table C.4 of the Supplementary file, the estimation performance is not good, especially when the test accuracy is low. Another potential limitation is that the method may rely heavily on the PO model assumption, and developing an appropriate model diagnosis based on our method is an important and much needed work in order to further promote the proposed method.

Future research effort will be devoted to extending the proposed method to more flexible semiparametric survival models, such as the generalized odds rate hazards model³⁴ and the linear transformation models, for analyzing misclassified current status data, with known or unknown test accuracy. Another direction of our future research is to extend our approach to regression analysis of misclassified multivariate current status data.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802261456076 - Supplemental material for Regression analysis of misclassified current status data with potentially unknown test accuracy

Supplemental material, sj-pdf-1-smm-10.1177_09622802261456076 for Regression analysis of misclassified current status data with potentially unknown test accuracy by Zhixin Chen and Lianming Wang in Statistical Methods in Medical Research

Footnotes

Acknowledgments

The authors would like to thank the Associate Editor and three anonymous reviewers for their insightful and constructive comments that have greatly improved the quality of the manuscript.

ORCID iDs

Zhixin Chen

Lianming Wang

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental material

Supplemental material for this article is available online.

References

Sun

. The statistical analysis of interval-censored failure time data. New York: Springer, 2006.

Wang

McMahan

Gallagher

, et al. Semiparametric group testing regression models. Biometrika 2014; 101: 587–598.

McKeown

Jewell

. Misclassification of current status data. Lifetime Data Anal 2010; 16: 215–230.

Laughlin

Baird

Savitz

, et al. Prevalence of uterine leiomyomas in the first trimester of pregnancy: an ultrasound-screening study. Obstet Gynecol 2009; 113: 630–635.

Wang

Dunson

. Semiparametric bayes’ proportional odds models for current status data with underreporting. Biometrics 2011; 67: 1111–1118.

Sun

. Regression analysis of misclassified current status data. J Nonparametr Stat 2020; 32: 1–19.

Wang

Zhao

. Additive hazards regression for misclassified current status data. Commun Math Stat 2025; 13: 507–526.

Huang

. Efficient estimation for the proportional hazards model with interval censoring. Ann Stat 1996; 24: 540–568.

McMahan

Wang

Tebbs

. Regression analysis for current status data using the em algorithm. Stat Med 2013; 32: 4452–4466.

10.

Rossini

Tsiatis

. A semiparametric proportional odds regression model for the analysis of current status data. J Am Stat Assoc 1996; 91: 713–721.

11.

Lin

Oakes

Ying

. Additive hazards regression with current status data. Biometrika 1998; 85: 289–298.

12.

Martinussen

Scheike

. Efficient estimation in additive hazards regression with current status data. Biometrika 2002; 89: 649–658.

13.

Tian

Cai

. On the accelerated failure time model for current status and interval censored data. Biometrika 2006; 93: 329–342.

14.

Sun

. Semiparametric linear transformation models for current status data. Can J Stat 2005; 33: 85–96.

15.

Liu

. Efficient estimation of a linear transformation model for current status data via penalized splines. Stat Methods Med Res 2020; 29: 3–14.

16.

Lam

Xue

. A semiparametric regression cure model with current status data. Biometrika 2005; 92: 573–586.

17.

. Additive risk model for current status data with a cured subgroup. Ann Inst Stat Math 2011; 63: 117–134.

18.

Zhang

Sun

. Statistical analysis of current status data with informative observation times. Stat Med 2005; 24: 1399–1407.

19.

Sun

. Sieve maximum likelihood regression analysis of dependent current status data. Biometrika 2015; 102: 731–738.

20.

Dunson

Dinse

. Bayesian models for multivariate current status data with informative censoring. Biometrics 2002; 58: 79–88.

21.

Chen

Tong

Sun

. A frailty model approach for regression analysis of multivariate current status data. Stat Med 2009; 28: 3424–3436.

22.

Zhang

. Copula-based analysis of dependent current status data with semiparametric linear transformation model. Lifetime Data Anal 2024; 30: 742–775.

23.

Zhang

Zhao

Wang

. Regression analysis of a graphical proportional hazards model for informatively left-truncated current status data. Lifetime Data Anal 2025; 31: 498–542.

24.

Sal

Rosas

Hughes

. Nonparametric and semiparametric analysis of current status data subject to outcome misclassification. Stat Commun Infect Dis 2011; 3(1): 7.

25.

Fang

Sun

, et al. Semiparametric probit regression model with misclassified current status data. Stat Med 2023; 42: 4440–4457.

26.

Tian

, et al. A simulation-extrapolation approach for regression analysis of misclassified current status data with the additive hazards model. Stat Med 2021; 40: 6309–6320.

27.

Ramsay

. Monotone regression splines in action. Stat Sci 1988; 3: 425–441.

28.

Wang

. Regression analysis of arbitrarily censored survival data under the proportional odds model. Stat Med 2021; 40: 3724–3739.

29.

Lin

Wang

. A semiparametric probit model for case 2 interval-censored failure time data. Stat Med 2010; 29: 972–981.

30.

Wang

McMahan

Hudgens

, et al. A flexible, computationally efficient method for fitting the proportional hazards model to interval-censored data. Biometrics 2016; 72: 222–231.

31.

Withana Gamage

McMahan

Wang

. A flexible parametric approach for analyzing arbitrarily censored data that are potentially subject to left truncation under the proportional hazards model. Lifetime Data Anal 2023; 29: 188–212.

32.

Louis

. Finding the observed information matrix when using the EM algorithm. J R Stat Soc Ser B: Stat Methodol 1982; 44: 226–233.

33.

Withana Gamage

Chaudari

McMahan

, et al. An extended proportional hazards model for interval-censored data subject to instantaneous failures: PWW Gamage et al. Lifetime Data Anal 2020; 26: 158–182.

34.

Banerjee

Chen

Dey

, et al. Bayesian analysis of generalized odds-rate hazards models for survival data. Lifetime Data Anal 2007; 13: 241–260.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.23 MB