Empirical likelihood inference for the area under the receiver operating characteristic (ROC) curve with verification biased data

Abstract

In medical diagnostic studies, the area under the receiver operating characteristic curve (AUC) is a widely used metric that captures a continuous test’s overall ability to discriminate between diseased and non-diseased individuals across all possible cutoffs. However, in practice, disease status is sometimes only partially verified, introducing verification bias that undermines the validity of AUC estimation. While numerous methods address bias correction for AUC estimation, approaches that directly construct confidence intervals for the AUC remain limited. This paper proposes two robust methods for constructing bias-corrected confidence intervals for the AUC under the missing-at-random assumption: one based on bootstrap resampling and the other on empirical likelihood. Both approaches accommodate missing disease verification by leveraging the bias-corrected ROC estimators introduced by Alonzo and Pepe. Extensive simulation studies and real-world data analyses demonstrate that our proposed methods yield valid and precise interval estimates for the AUC under various clinically relevant settings.

Keywords

Receiver operating characteristic (ROC) curve area under the ROC curve (AUC)verification bias missing-at-random empirical likelihood diagnostic tests

1. Introduction

Diagnostic tests have become indispensable tools in modern medicine, playing a pivotal role in patient care. Their typically rapid, inexpensive, and non-invasive nature allows for broad application across large populations. However, the ability of different diagnostic tests to discriminate between healthy and diseased individuals can vary substantially. Consequently, rigorous assessment of diagnostic accuracy is essential before their adoption into clinical practice. The area under the receiver operating characteristic curve (AUROC, or AUC) is a widely used summary measure of a test’s ability to distinguish between diseased and non-diseased subjects. Mathematically, the AUC represents the probability that the score of a randomly selected diseased individual exceeds that of a randomly selected non-diseased individual, assuming that higher scores indicate a greater likelihood of disease.¹ Typical AUC values range from 0.5 (no discriminatory ability) to 1.0 (perfect discrimination). Values below 0.5 usually suggest that the positive and negative labels have been reversed.^2,3

Estimation of the AUC requires knowledge of the true disease status, determined by a reference standard (commonly referred to as the gold standard), which is the most reliable method currently available for confirming the presence or absence of the target condition. However, such tests are often expensive, labor-intensive, or invasive. As a result, in some situations, not all subjects who undergo the diagnostic test have their true disease status verified through a fully accurate gold standard test. For example, during the COVID-19 pandemic, when medical resources were limited, not everyone was able to undergo nucleic acid amplification testing, which is considered the gold standard for diagnosing COVID-19. In other words, the labels representing the true disease status of some subjects are partially missing. Simply ignoring this missingness and analyzing only subjects with verified disease status may lead to verification bias.⁴ To address verification bias, the missing-at-random (MAR) assumption⁵ is often adopted. It posits that the probability of a subject being verified does not depend on the true disease status and is only a function of observed data. The MAR assumption is considered plausible and feasible when data are missing by design.⁶ For instance, a clinician may choose to administer the costly gold standard test only to patients classified as high-risk based on their initial diagnostic test results. Because patients at minimal risk are more likely to have their true disease status missing, the application of existing complete-data approaches under the MAR assumption may result in biased inference and loss of efficiency.

Several verification bias-corrected approaches have been proposed for estimating the AUC of a diagnostic test under the MAR assumption. Alonzo and Pepe⁷ proposed and compared several imputation- and reweighting-based bias-corrected estimators of the ROC curve for continuous tests. He et al.⁸ proposed a $U$ -statistic- and inverse probability weighting (IPW)-based method to directly estimate the AUC under verification bias, deriving closed-form expressions for the estimator and its variance. Adimari and Chiogna⁹ proposed a fully non-parametric method for estimating the AUC of a continuous test under verification bias based on nearest-neighbor imputation and generic smooth regression models. Gu et al.¹⁰ proposed a Bayesian approach for estimating the ROC curve and AUC from continuous test data, employing a rank-based likelihood with Gibbs sampling. Hai and Qin¹¹ developed and evaluated several direct AUC estimation methods, using different approaches to estimate sensitivity and specificity, and recommended the semi-parametric efficient-mean score imputation (SPE–MSI) combination-based AUC estimator. See Nakas et al.,¹² Stahlmann et al.,¹³ Umemneku Chikere et al.,¹⁴ and To¹⁵ for a comprehensive review of ROC analysis under verification bias. However, while numerous methods address bias correction for AUC point estimation, to our knowledge, interval estimation methods for the AUC with verification bias remain limited. Our goal in this study is to develop new confidence intervals for the AUC with verification biased data under the MAR assumption.

Empirical likelihood (EL), as an alternative to bootstrap for constructing non-parametric confidence regions, was introduced by Owen.^16,17 It provides a flexible non-parametric likelihood framework and has been widely used for statistical inference in a variety of settings.^18–20 Key features of empirical likelihood include its automatic determination of the shape and orientation of a confidence region from the data and its implicit studentization, carried out through internal optimization without requiring explicit covariance estimation.²¹ Based on the mean-like form of the Wilcoxon–Mann–Whitney estimator, Qin and Zhou²² proposed an EL approach for inference on the AUC, which was shown to have good small-sample performance. Qin and Wang²³ proposed an imputation-based empirical likelihood method for constructing confidence intervals for the AUC with missing-completely-at-random data, offering strong small-sample performance while preserving the original data distribution. Inspired by their work, we propose EL-based confidence intervals for the AUC for continuous tests under the MAR assumption.

The remainder of the paper is organized as follows. In Section 2, we briefly review existing methods for estimating the sensitivity and specificity of a continuous test under verification bias. In Section 3, we motivate and develop our proposed confidence intervals for the AUC. Simulation studies and sensitivity analyses are presented in Section 4, followed by a real-world data study in Section 5. Finally, we conclude the paper with a discussion in Section 6.

2. Existing estimators for the sensitivity and specificity with verification bias

Consider a two-phase design. In the first phase, a continuous screening test is conducted on all subjects. Let $T$ denote the test result and $D$ denote the true disease status of a subject, where $D = 1$ indicates that the subject has the disease and $D = 0$ indicates that the subject does not have the disease. In the second phase, a subset of subjects undergoes a more accurate diagnostic test to confirm the disease status. Let $V = 1$ if the subject has the true disease status verified, and $V = 0$ otherwise. Let $A$ be a vector of observed covariates of the subject that may be associated with both $D$ and $V$ , with no missing values. Without loss of generality, suppose that larger values of $T$ are more indicative of disease and a test result exceeding the value $c$ (a cut-off value of the test) suggests a positive diagnosis of the disease. All methods reviewed in this section assume that verification of disease status is conditionally independent of the true disease status given the test result and the covariates (MAR). The decision to verify the subject’s true disease status depends on the true disease status only through $T$ and $A$ . That is

\begin{aligned} P (V = 1 | T, A, D) = P (V = 1 | T, A) . \end{aligned}

In a sample of size $n$ , let $S_{i} \equiv (T_{i}, A_{i}, V_{i}, D_{i})$ denote the observed value for $(T, A, V, D)$ for the $i$ -th subject, $i = 1, \dots, n$ . If $V_{i} = 0$ , $D_{i}$ is missing. If all subjects have their disease status verified, that is, $V_{i}$ = 1, $i = 1, 2, \dots, n$ , we have a complete dataset. For any cutoff point $c$ of the test result, the sensitivity and the specificity of the test can be consistently estimated by

\begin{aligned} {\hat{Sen}}_{F u l l} (c) & = \frac{\sum_{i = 1}^{n} I (T_{i} \geq c) D_{i}}{\sum_{i = 1}^{n} D_{i}}, \end{aligned}

\begin{aligned} {\hat{Spe}}_{F u l l} (c) & = \frac{\sum_{i = 1}^{n} I (T_{i} < c) (1 - D_{i})}{\sum_{i = 1}^{n} (1 - D_{i})}, \end{aligned}

respectively.

In practice, only a subset of the $n$ subjects has their disease status verified. The above estimates for the sensitivity and the specificity cannot be used and need to be corrected. Alonzo and Pepe⁷ introduced several methods, including the full imputation (FI) method, the MSI method, the IPW method, and the SPE method, to estimate the sensitivity and the specificity. We briefly review these methods below.

2.1. Full imputation

One imputation-based approach to estimate the prevalence of the disease in a two-phase design is to use full imputation (FI) over the distribution $P (D | T, A)$ , that is, FI imputes the probability of disease for all subjects in the study as a function of $(T_{i}, A_{i})$ ’s. The FI estimator of the disease prevalence is

\hat{P} (D = 1) = \frac{1}{n} \sum_{i = 1}^{n} {\hat{ρ}}_{i},

where

{\hat{ρ}}_{i}

is an estimate of

ρ_{i} = P (D_{i} = 1 | T_{i}, A_{i})

and obtained by using, for example, a logistic or probit regression model. By the MAR assumption, the disease model

P (D = 1 | T, A)

can be estimated by using the verified sample. Then, the FI estimators of the sensitivity and the specificity are given by

\begin{aligned} {\hat{Sen}}_{F I} (c) & = \frac{\sum_{i = 1}^{n} I (T_{i} \geq c) {\hat{ρ}}_{i}}{\sum_{i = 1}^{n} {\hat{ρ}}_{i}}, \end{aligned}

(1)

\begin{aligned} {\hat{Spe}}_{F I} (c) & = \frac{\sum_{i = 1}^{n} I (T_{i} < c) (1 - {\hat{ρ}}_{i})}{\sum_{i = 1}^{n} (1 - {\hat{ρ}}_{i})} . \end{aligned}

(2)

2.2. Mean score imputation

MSI is another imputation-based approach used for estimating the prevalence of the disease in two-phase studies. In contrast to FI, MSI only imputes disease status for subjects who are not in the verification sample and uses the observed disease status for those who are in the verification sample. Then, the MSI estimators of the sensitivity and the specificity are given by

\begin{aligned} {\hat{Sen}}_{M S I} (c) & = \frac{\sum_{i = 1}^{n} I (T_{i} \geq c) {V_{i} D_{i} + (1 - V_{i}) {\hat{ρ}}_{i}}}{\sum_{i = 1}^{n} {V_{i} D_{i} + (1 - V_{i}) {\hat{ρ}}_{i}}}, \end{aligned}

(3)

\begin{aligned} {\hat{Spe}}_{M S I} (c) & = \frac{\sum_{i = 1}^{n} I (T_{i} < c) {V_{i} (1 - D_{i}) + (1 - V_{i}) (1 - {\hat{ρ}}_{i})}}{\sum_{i = 1}^{n} {V_{i} (1 - D_{i}) + (1 - V_{i}) (1 - {\hat{ρ}}_{i})}}, \end{aligned}

(4)

Again, the MAR assumption implies that data from the verification sample can be used to obtain a valid estimate

{\hat{ρ}}_{i}

for

ρ_{i}

2.3. Inverse probability weighting

An IPW estimator that weights each observation in the verification sample by the inverse of the sampling fraction (i.e. the probability that the subject was selected for verification) is a reweighting-based approach used to estimate the prevalence of the disease in a two-phase design.^24,25 The IPW estimators of the sensitivity and the specificity are given by

\begin{aligned} {\hat{Sen}}_{I P W} (c) & = \frac{\sum_{i = 1}^{n} I (T_{i} \geq c) V_{i} D_{i} {\hat{π}}_{i}^{- 1}}{\sum_{i = 1}^{n} V_{i} D_{i} {\hat{π}}_{i}^{- 1}}, \end{aligned}

(5)

\begin{aligned} {\hat{Spe}}_{I P W} (c) & = \frac{\sum_{i = 1}^{n} I (T_{i} < c) V_{i} (1 - D_{i}) {\hat{π}}_{i}^{- 1}}{\sum_{i = 1}^{n} V_{i} (1 - D_{i}) {\hat{π}}_{i}^{- 1}} . \end{aligned}

(6)

where

{\hat{π}}_{i}

is an estimate of

π_{i} = P (V_{i} = 1 | T_{i}, A_{i})

. The IPW method corrects for the biased sampling by weighting the observed value by the inverse of the probability that the subject was verified.

2.4. Semi-parametric efficient approach

Gao et al.²⁶ and Alonzo et al.²⁷ independently derived the following SPE estimators of the sensitivity and the specificity

\begin{aligned} {\hat{Sen}}_{S P E} (c) & = \frac{\sum_{i = 1}^{n} I (T_{i} \geq c) {V_{i} D_{i} {\hat{π}}_{i}^{- 1} - (V_{i} - {\hat{π}}_{i}) {\hat{ρ}}_{i} {\hat{π}}_{i}^{- 1}}}{\sum_{i = 1}^{n} {V_{i} D_{i} {\hat{π}}_{i}^{- 1} - (V_{i} - {\hat{π}}_{i}) {\hat{ρ}}_{i} {\hat{π}}_{i}^{- 1}}} \end{aligned}

(7)

\begin{aligned} {\hat{Spe}}_{S P E} (c) & = \frac{\sum_{i = 1}^{n} I (T_{i} < c) {V_{i} (1 - D_{i}) {\hat{π}}_{i}^{- 1} - (V_{i} - {\hat{π}}_{i}) (1 - {\hat{ρ}}_{i}) {\hat{π}}_{i}^{- 1}}}{\sum_{i = 1}^{n} {V_{i} (1 - D_{i}) {\hat{π}}_{i}^{- 1} - (V_{i} - {\hat{π}}_{i}) (1 - {\hat{ρ}}_{i}) {\hat{π}}_{i}^{- 1}}} \end{aligned}

(8)

The estimators above are considered to be semi-parametric because parametric models need to be specified for both the disease model

P (D | T, A)

and the verification model

P (V | T, A)

but the joint distribution

P (D, T, A)

is still unknown. Combining both the imputation (using

{\hat{ρ}}_{i}

) and the reweighting approach (using

{\hat{π}}_{i}^{- 1}

), these SPE estimators have the attractive attribute that they are “doubly robust” in that they are consistent if either

π_{i}

ρ_{i}

is estimated consistently, that is, the verification model or the disease model can be incorrectly specified and the consistency is still guaranteed. Therefore, they are recommended by Alonzo et al.²⁷ to assess the accuracy of a continuous screening test in two-phase studies. However, it is worth noting that SPE estimates may not be range-respecting, that is, they could fall outside the interval

(0, 1)

. This happens because the quantities

{V_{i} D_{i} {\hat{π}}_{i}^{- 1} - (V_{i} - {\hat{π}}_{i}) {\hat{ρ}}_{i} {\hat{π}}_{i}^{- 1}}

and

{V_{i} (1 - D_{i}) {\hat{π}}_{i}^{- 1} - (V_{i} - {\hat{π}}_{i}) (1 - {\hat{ρ}}_{i}) {\hat{π}}_{i}^{- 1}}

can be negative.¹⁵

3. Confidence intervals for the AUC with verification bias

3.1. AUC estimation with verification bias

We start with the point estimators for the AUC with verification bias. Let $X_{i} = T_{i} | {D_{i} = 0}$ , $i = 1, 2, \dots, n_{0}$ , and $Y_{j} = T_{j} | {D_{j} = 1}$ , $j = 1, 2, \dots, n_{1}$ , be observations from the non-diseased group $X$ and the diseased group $Y$ , respectively. $n = n_{0} + n_{1}$ is the total number of observations. Let $F$ and $G$ be the cumulative distribution functions of $X$ and $Y$ , respectively. So, we have

\begin{aligned} F (t) & = P (X < t) = P (T_{i} < t | D_{i} = 0) \equiv Spe (t), \\ G (t) & = P (Y < t) = P (T_{j} < t | D_{j} = 1) = 1 - E_{G} [I (T_{j} \geq t | D_{j} = 1)] \equiv 1 - Sen (t) \end{aligned}

Observe that the AUC

δ

can be expressed as

\begin{aligned} δ & = P (X < Y) \\ = P (T_{j} \geq T_{i} | D_{j} = 1, D_{i} = 0) \\ = E_{F} {E_{G} [I (T_{j} \geq T_{i} | T_{i}, D_{j} = 1, D_{i} = 0)]} \\ = E_{F} [1 - G (T_{i})] = \int_{- \infty}^{\infty} (1 - G (t)) d F (t) . \end{aligned}

(9)

With verification biased data, using weights

({\hat{w}}_{a, i}^{N}, {\hat{w}}_{a, j}^{D}), a \in {F I, M S I, I P W, S P E}

, we get an estimate for

F (t)

and

G (t)

respectively

{\hat{F}}_{a} (t) = \frac{\sum_{i = 1}^{n} I (T_{i} < t) {\hat{w}}_{a, i}^{N}}{\sum_{i = 1}^{n} {\hat{w}}_{a, i}^{N}}, {\hat{G}}_{a} (t) = \frac{\sum_{j = 1}^{n} I (T_{j} < t) {\hat{w}}_{a, j}^{D}}{\sum_{j = 1}^{n} {\hat{w}}_{a, j}^{D}},

(10)

where weights

({\hat{w}}_{a, i}^{N}, {\hat{w}}_{a, j}^{D})

are defined as follows

\begin{aligned} {\hat{w}}_{F I, i}^{N} & = 1 - {\hat{ρ}}_{i}, \\ {\hat{w}}_{F I, i}^{D} & = {\hat{ρ}}_{i}, \\ {\hat{w}}_{M S I, i}^{N} & = V_{i} (1 - D_{i}) + (1 - V_{i}) (1 - {\hat{ρ}}_{i}), \\ {\hat{w}}_{M S I, i}^{D} & = V_{i} D_{i} + (1 - V_{i}) {\hat{ρ}}_{i}, \\ {\hat{w}}_{I P W, i}^{N} & = V_{i} (1 - D_{i}) {\hat{π}}_{i}^{- 1}, \\ {\hat{w}}_{I P W, i}^{D} & = V_{i} D_{i} {\hat{π}}_{i}^{- 1}, \\ {\hat{w}}_{S P E, i}^{N} & = V_{i} (1 - D_{i}) {\hat{π}}_{i}^{- 1} - (V_{i} - {\hat{π}}_{i}) (1 - {\hat{ρ}}_{i}) {\hat{π}}_{i}^{- 1}, \\ {\hat{w}}_{S P E, i}^{D} & = V_{i} D_{i} {\hat{π}}_{i}^{- 1} - (V_{i} - {\hat{π}}_{i}) {\hat{ρ}}_{i} {\hat{π}}_{i}^{- 1} . \end{aligned}

Therefore, we can directly get a bias-corrected estimator for the AUC

\begin{aligned} {\hat{δ}}_{a} & = \int_{- \infty}^{\infty} (1 - {\hat{G}}_{a} (t)) d {\hat{F}}_{a} (t) \\ = \frac{\sum_{i = 1}^{n} (1 - {\hat{G}}_{a} (T_{i})) {\hat{w}}_{a, i}^{N}}{\sum_{i = 1}^{n} {\hat{w}}_{a, i}^{N}} \\ = \frac{\sum_{i, j = 1}^{n} I (T_{i} \leq T_{j}) {\hat{w}}_{a, i}^{N} {\hat{w}}_{a, j}^{D}}{\sum_{i, j = 1}^{n} {\hat{w}}_{a, i}^{N} {\hat{w}}_{a, j}^{D}} . \end{aligned}

(11)

Hai and Qin¹¹ proposed this direct estimator ${\hat{δ}}_{a}$ for the AUC but did not provide a variance estimate for ${\hat{δ}}_{a}$ . He et al.⁸ proposed an asymptotic variance estimate for the IPW-based AUC estimator (i.e. ${\hat{δ}}_{a}$ with $a = I P W$ ) under the assumption that the probability $π_{i}$ of verification is known. However, in practice, the true value of $π_{i}$ is unknown, and explicitly estimating the variances of this estimator for the AUC is still an open question.

We may apply the bootstrap method to estimate the variance of ${\hat{δ}}_{a}$ . The procedure for computing the bootstrap variance is summarized in the following steps:

Draw a bootstrap sample of size $n$ from the original sample $S_{i} \equiv (T_{i}, A_{i}, V_{i}, D_{i}), i = 1, \dots, n$ .

Calculate a bootstrap copy ${\hat{δ}}_{a}^{*}$ of ${\hat{δ}}_{a}$ using the bootstrap sample.

Repeat the first two steps $B$ times to obtain the set of bootstrap replications of ${\hat{δ}}_{a}$ ${{\hat{δ}}_{a, b}^{*} : b = 1, 2, \dots, B}$ .

Then, the bootstrap variance estimator $V^{*}$ is defined by

V^{*} = \frac{1}{B - 1} \sum_{b = 1}^{B} ({\hat{δ}}_{a, b}^{*} - {\bar{\hat{δ}}}_{a}^{*})^{2},

where

{\bar{\hat{δ}}}_{a}^{*} = (1 / B) \sum_{b = 1}^{B} {\hat{δ}}_{a, b}^{*}

After obtaining such an appropriate variance estimate, we can construct bootstrap confidence intervals for the AUC $δ$ .

The $(1 - α)$ 100% level bootstrap interval, called the BC interval, for $δ$ is defined by

({\bar{\hat{δ}}}_{a}^{*} - z_{1 - α / 2} \sqrt{V^{*}}, {\bar{\hat{δ}}}_{a}^{*} + z_{1 - α / 2} \sqrt{V^{*}})

We also have a $(1 - α)$ 100% level bootstrap percentile interval, called the BP interval, for $δ$

({\hat{δ}}_{a, b}^{* ([B α / 2])}, {\hat{δ}}_{a, b}^{* ([B (1 - α / 2)])})

where

{\hat{δ}}_{a, b}^{* ([B α / 2])}

and

{\hat{δ}}_{a, b}^{* ([B (1 - α / 2)])}

are the

α / 2

-th and (

1 - α / 2

)-th quantiles of the bootstrap replications

{\hat{δ}}_{a, b}^{*}

’s, respectively.

3.2. Empirical likelihood estimation for the AUC with verification bias

Pepe and Cai²⁸ defined the following placement value

U = 1 - F (Y) .

U

essentially marks the placement of

Y

within the non-diseased distribution. It is easily seen that

E_{G} (1 - U) = E_{G} (F (Y)) = P (X < Y) = δ .

that is

E_{G} (1 - U - δ) = 0.

Noticing this relationship between the AUC and the placement value

U

, we can define an empirical likelihood for the AUC.

Let $U_{j} = 1 - F (T_{j})$ . With verification bias, $G$ is unknown but can be estimated by ${\hat{G}}_{a}$ . So, an empirical likelihood for $δ$ can be defined as

\begin{aligned} L_{a} (δ) = sup {\prod_{j = 1}^{n} p_{j} : p_{j} \geq 0, \sum_{j = 1}^{n} p_{j} = 1, \sum_{j = 1}^{n} \frac{p_{j} {\hat{w}}_{a, j}^{D}}{\sum_{j = 1}^{n} {\hat{w}}_{a, j}^{D}} (1 - U_{j} - δ) = 0} . \end{aligned}

(12)

However, $L_{a} (δ)$ cannot be computed due to the unknown values of $U_{j}$ ’s (since $F$ is unknown) and the presence of verification bias. But we can estimate $U_{j}$ by

{\hat{U}}_{j, a} = \frac{\sum_{i = 1}^{n} I (T_{i} \geq T_{j}) {\hat{w}}_{a, i}^{N}}{\sum_{i = 1}^{n} {\hat{w}}_{a, i}^{N}},

(13)

and get the following estimated EL for

δ

{\hat{L}}_{a} (δ) = sup {\prod_{j = 1}^{n} p_{j} : p_{j} \geq 0, \sum_{j = 1}^{n} p_{j} = 1, \sum_{j = 1}^{n} p_{j} [\frac{{\hat{w}}_{a, j}^{D}}{\sum_{j = 1}^{n} {\hat{w}}_{a, j}^{D}} (1 - {\hat{U}}_{j, a} - δ)] = 0} .

(14)

The corresponding empirical log-likelihood ratio statistic for the AUC is

{\hat{l}}_{a} (δ) = 2 \sum_{j = 1}^{n} \log {1 + λ_{v} [\frac{{\hat{w}}_{a, j}^{D}}{\sum_{j = 1}^{n} {\hat{w}}_{a, j}^{D}} (1 - {\hat{U}}_{j, a} - δ)]},

(15)

where

λ_{v}

is the solution of

\begin{aligned} \frac{1}{n} \sum_{l = 1}^{n} \frac{({\hat{w}}_{a, j}^{D} / \sum_{j = 1}^{n} {\hat{w}}_{a, j}^{D}) (1 - {\hat{U}}_{j, a} - δ)}{1 + λ_{v} [({\hat{w}}_{a, j}^{D} / \sum_{j = 1}^{n} {\hat{w}}_{a, j}^{D}) (1 - {\hat{U}}_{j, a} - δ)]} = 0. \end{aligned}

(16)

Under some regularity conditions (see Appendix A, see supplemental material), we can show that the asymptotic distribution of ${\hat{l}}_{a} (δ_{0})$ is a scaled $χ^{2}$ distribution with one degree of freedom, where $δ_{0}$ is the true value of the AUC. That is, as $n \overset{\infty}{\to}$

ω {\hat{l}}_{a} (δ_{0}) \overset{L}{⟶} χ_{1}^{2},

(17)

where

ω

is an unknown scale constant. We provide a proof sketch in Supplemental material Appendix A (also see Qin and Zhou,²² Qin et al.,²⁹ Dong and Tian³⁰).

In order to construct confidence intervals for the AUC based on (17), we propose the following bootstrap procedure:

Compute ${\hat{δ}}_{a}$ and ${\hat{U}}_{j, a}$ ’s using $S_{i} \equiv (T_{i}, A_{i}, V_{i}, D_{i})$ , $i = 1, \dots, n$ from (11) and (13), respectively.

Draw a bootstrap sample $S_{i}^{*}$ ’s of size $n$ from $S_{i}$ ’s.

Compute the bootstrap copy ${\hat{U}}_{j, a}^{*}$ ’s of ${\hat{U}}_{j, a}$ ’s from (13) using the bootstrap sample $S_{i}^{*}$ ’s.

Compute the bootstrap copy ${\hat{l}}_{a}^{*} ({\hat{δ}}_{a})$ of ${\hat{l}}_{a} ({\hat{δ}}_{a})$ from (15) and (16) using ${\hat{δ}}_{a}$ obtained from step 1 and bootstrap copy ${\hat{U}}_{j, a}^{*}$ ’s from step 3.

Repeat steps 2 to 4 $B$ times to obtain $B$ bootstrap copies ${\hat{l}}_{a, b}^{*} ({\hat{δ}}_{a})$ ’s, $b = 1, \dots, B$ .

Compute $χ_{α}$ as the $(1 - α)$ -th sample quantile of the $B$ bootstrap copies ${\hat{l}}_{a, b}^{*} ({\hat{δ}}_{a})$ ’s, $b = 1, \dots, B$ .

Compute the median $M_{a}$ of the $B$ bootstrap copies ${\hat{l}}_{a, b}^{*} ({\hat{δ}}_{a})$ ’s, $b = 1, \dots, B$ . Since the median of $χ_{1}^{2}$ is $(7 / 9)^{3}$ , $ω$ can be estimated by $\hat{ω} = (7 / 9)^{3} / M_{a}$ .

Then, two $(1 - α)$ level hybrid EL (HEL) confidence intervals for the AUC can be defined as the solution of the following inequalities in $δ$

\begin{aligned} C I_{H E L 1} (δ) & = {δ : {\hat{l}}_{a} (δ) \leq χ_{α}}, \end{aligned}

\begin{aligned} C I_{H E L 2} (δ) & = {δ : \hat{ω} {\hat{l}}_{a} (δ) \leq χ_{1}^{2} (α)}, \end{aligned}

where

χ_{1}^{2} (α)

is the

(1 - α)

-th quantile of the

χ_{1}^{2}

distribution with one degree of freedom. The HEL1 interval is obtained by direct bootstrap quantile calibration of the empirical likelihood ratio statistic, whereas HEL2 is constructed under a scaled-

χ_{1}^{2}

approximation to the empirical likelihood ratio and uses bootstrap median matching to estimate the unknown scale parameter

ω

. Therefore, HEL1 is a fully bootstrap-calibrated procedure, while HEL2 combines bootstrap estimation with a

χ_{1}^{2}

distribution-based approximation.

To make our proposed methods accessible for practitioners, an R package rocvb implementing these approaches is developed and available on GitHub. It is built upon the R package emplik developed by Zhou.³¹

4. Simulation studies

In this section, extensive simulation studies are conducted to evaluate the finite sample performance and robustness of the proposed confidence intervals for the AUC in the presence of verification bias at the 95% nominal level. The simulation setup is similar to that of Alonzo and Pepe⁷ and Hai and Qin.¹¹ Briefly, $D$ is generated as a dichotomous variable indicating whether a random variable $Z = Z_{1} + Z_{2} \sim N (0, 1)$ is greater than a threshold $h$ , where $Z_{1} \sim N (0, 0.5)$ and $Z_{2} \sim N (0, 0.5)$ are independent. To investigate the performance of the proposed confidence intervals under various prevalence, we select different $h$ to make the prevalence of disease equal to $0.1$ , $0.3$ , $0.5$ , and $0.8$ , respectively. $T$ and $A$ are generated from $T = ν_{1} Z_{1} + τ_{1} Z_{2} + ϵ_{1}$ and $A = ν_{2} Z_{1} + τ_{2} Z_{2} + ϵ_{2}$ , where $ϵ_{1} \sim N (0, 0.25)$ , $ϵ_{2} \sim N (0, 0.25)$ , and $ϵ_{1}$ and $ϵ_{2}$ are independent. To assess tests with varying inherent diagnostic potential, we choose values of $ν_{1}$ and $τ_{1}$ such that the true AUC of test $T$ ranges from 0.6 to 0.96. The detailed parameter setting in the study can be found in Table 1. For covariate $A$ we use $ν_{2} = τ_{2} = 1$ in this paper if not specified.

Table 1.
Parameter settings in simulation studies.

Prevalence AUC $ν_{1}$ $τ_{1}$ AUC $ν_{3}$ $τ_{3}$

0.1 0.608 0.1 0.1 0.644 0.1 0.05

0.699 0.3 0.1 0.739 0.25 0.05

0.839 1 0.1 0.838 0.5 0.1

0.960 1 1 0.960 6 2

0.3 0.592 0.1 0.1 0.600 0.1 0.05

0.672 0.3 0.1 0.728 0.4 0.05

0.803 1 0.1 0.822 1 0.1

0.941 1 1 0.949 6 2

0.5 0.589 0.1 0.1 0.632 0.2 0.05

0.666 0.3 0.1 0.726 0.5 0.05

0.795 1 0.1 0.838 1 0.2

0.936 1 1 0.949 6 2

0.8 0.597 0.1 0.1 0.638 0.25 0.05

0.681 0.3 0.1 0.731 0.6 0.05

0.815 1 0.1 0.841 1.2 0.2

0.948 1 1 0.962 6 2

Prevalence	AUC	$ν_{1}$	$τ_{1}$	AUC	$ν_{3}$	$τ_{3}$
0.1	0.608	0.1	0.1	0.644	0.1	0.05
	0.699	0.3	0.1	0.739	0.25	0.05
	0.839	1	0.1	0.838	0.5	0.1
	0.960	1	1	0.960	6	2
0.3	0.592	0.1	0.1	0.600	0.1	0.05
	0.672	0.3	0.1	0.728	0.4	0.05
	0.803	1	0.1	0.822	1	0.1
	0.941	1	1	0.949	6	2
0.5	0.589	0.1	0.1	0.632	0.2	0.05
	0.666	0.3	0.1	0.726	0.5	0.05
	0.795	1	0.1	0.838	1	0.2
	0.936	1	1	0.949	6	2
0.8	0.597	0.1	0.1	0.638	0.25	0.05
	0.681	0.3	0.1	0.731	0.6	0.05
	0.815	1	0.1	0.841	1.2	0.2
	0.948	1	1	0.962	6	2

AUC: area under the receiver operating characteristic curve.

To introduce verification bias under the MAR assumption, $V$ is generated as a Bernoulli random variable with $P (V = 1) = 1$ for subjects with $T > t^{(q)}$ and $P (V = 1) = 1 - q$ for the rest, where $0 < q < 1$ , and $t^{(q)}$ is the $100 q$ -th quantile of the distribution of $T$ . Here we use $q = 0.6$ . This verification mechanism results in an average of 64% of the subjects receiving disease verification. In all simulation studies, $1000$ random samples are generated from the underlying distributions with sample sizes $n = 50, 100, 150$ , respectively, to obtain coverage probabilities and average lengths of the proposed confidence intervals. For the bootstrap-based methods, we use $B = 1000$ . We evaluate the performance of the proposed intervals using each of the four bias-corrected estimators (i.e. ${\hat{δ}}_{F I}, {\hat{δ}}_{M S I}, {\hat{δ}}_{I P W}$ , and ${\hat{δ}}_{S P E}$ ) and compare their overall performance across 42 clinically relevant settings using boxplots, in order of the best approach (by coverage probability) on the left side. For the sake of brevity, we include only the boxplots in the main text. The detailed numerical results are available in Appendix A (see Supplemental material).

4.1. Correct models

First, we evaluate the performance of the proposed confidence intervals using the correct disease and verification models. In order to apply FI, MSI, and SPE estimators, a parametric model for the disease probabilities $ρ_{i}$ ’s is required to be specified. It has been shown in Alonzo and Pepe⁷ that a probit model that is linear in $T$ and $A$ is a correct model under the above settings. We apply the same probit model for $ρ_{i}$ ’s in our correct model study

ρ_{i} = P (D_{i} = 1 | T_{i}, A_{i}) = Φ (α + β T_{i} + γ A_{i}), i = 1, \dots, n .

(18)

And for the correct verification model

π_{i}

, empirical estimates of the verification probabilities yield

{\hat{π}}_{i} = {\begin{cases} 1, & T_{i} > t^{(q)} \\ \frac{\sum_{i = 1}^{n} V_{i} I (T_{i} \leq t^{(q)})}{\sum_{i = 1}^{n} I (T_{i} \leq t^{(q)})}, & T_{i} \leq t^{(q)} \end{cases}

(19)

where

q = 0.6

and

i = 1, \dots, n

We report coverage probabilities and average widths of the proposed confidence intervals using boxplots in Figure 1 and Table 4 (in Appendix A.2, see supplemental material). From Figure 1 and Table 4 (in Appendix A.2, see supplemental material), we can observe that when both disease and verification models are correctly specified, HEL2-FI and HEL2-MSI (HEL2 interval using the FI and MSI estimator, respectively) deliver the strongest overall performance, achieving coverage closest to the nominal level while maintaining relatively narrow intervals. HEL1-FI and HEL1-MSI intervals show slight over-coverage and yield wider intervals. BP-SPE and BC-SPE follow closely, providing adequate coverage and interval width, with BP intervals generally showing a modest advantage over BC intervals in coverage. In contrast, all IPW-based intervals perform the weakest. Across all methods, interval precision improves with increasing sample size and with larger true AUC values.

Figure 1.

Coverage probabilities and average widths of the proposed 95% level CIs: correct model. (HEL2-FI denotes HEL2 interval using FI estimator, etc. The same applies to Figures 2–4. The proposed intervals are sorted by coverage probability, from the best (left) to the worst (right). The same applies to Figures 2–4.) CI: confidence interval; FI: full imputation.

Figure 2.

Coverage probabilities and average widths of the proposed 95% level CIs: misspecified disease model 1. CI: confidence interval.

4.2. Misspecified models

While we have demonstrated the effectiveness of the proposed confidence intervals when both the disease model and the verification model are correctly specified, practical scenarios often involve uncertainty. In practice, we may not always know the true disease mechanism or the verification mechanism. To assess the robustness of our confidence intervals, we introduce model misspecification in this subsection. We follow simulation settings similar to those used by Alonzo and Pepe⁷ and Hai and Qin,¹¹ which include both a misspecified disease model and a misspecified verification model.

4.2.1. Misspecified disease model

We consider two types of misspecified disease models. For the first one, recall that disease status is determined by $D = I (Z_{1} + Z_{2} > h)$ , and $T$ and $A$ are linear combinations of $Z_{1}$ and $Z_{2}$ . In the setting where $A$ contains information only about $Z_{2}$ (i.e. $ν_{2} = 0$ and $τ_{2} = 1$ ), the probit model $ρ_{a} = P (D = 1 | A) = Φ (α + β A)$ which is linear in $A$ is misspecified because the disease process depends on aspects of $T$ that are not captured by $A$ . This is our first misspecified disease model.

We then explore a second misspecified disease model, noting that the probit model (18) is correct only when the underlying distribution of the test result $T$ is normal. However, this assumption does not always hold in practice, leading to our second misspecified disease model. Specifically, here we set $Z_{1} \sim Lognormal (0, 0.5)$ and $Z_{2} \sim Lognormal (0, 0.5)$ which are independent. Similarly, let $D = I (Z_{1} + Z_{2} > h)$ and $T = ν_{3} Z_{1} + τ_{3} Z_{2} + ϵ_{1}$ and $A = ν_{2} Z_{1} + τ_{2} Z_{2} + ϵ_{2}$ , where $ϵ_{1} \sim N (0, 0.25)$ , and $ϵ_{2} \sim N (0, 0.25)$ are independent. This setting results in test results in both the diseased and non-diseased groups having skewed distributions. We select appropriate values of $ν_{3}$ and $τ_{3}$ such that the true AUC of test $T$ ranges from 0.6 to 0.96. The detailed parameter settings in the study can be found in Table 1.

We report coverage probabilities and average widths of the proposed confidence intervals using boxplots in Figures 2 and 3 and Tables 5 and 6 (in Appendix A.2, see supplemental material). In both studies, we use the correct verification model (19). In Figure 2 and Table 5 (in Appendix A.2, see supplemental material), we observe that when a relevant covariate is omitted in the disease model, all intervals using FI and MSI estimators fail to maintain adequate coverage. This is not surprising, as FI and MSI estimators rely on imputing missing disease status through a parametric disease model, making their performance heavily dependent on correct model specification. In contrast, SPE-based intervals show expected robustness by maintaining strong coverage and outperform IPW-based intervals, despite the SPE estimator’s partial dependence on the misspecified disease model. Among SPE-based methods, BP-SPE, BC-SPE, HEL1-SPE, and HEL2-SPE achieve comparable coverage, with BP-SPE and BC-SPE intervals generally being more precise.

Figure 3.

Coverage probabilities and average widths of the proposed 95% level CIs: misspecified disease model 2. CI: confidence interval.

In Figure 3 and Table 6 (in Appendix A.2, see supplemental material), we observe that when test results follow a skewed mixture of lognormal distributions, HEL2-FI and HEL2-MSI intervals using a probit disease model maintain performance comparable to that observed under the correctly specified model scenario. The HEL2–SPE interval also demonstrates reliable coverage performance, although it is generally wider. HEL1 intervals again show slight over-coverage and increased width, while BP and BC intervals exhibit slight under-coverage. Overall, these results highlight the strong robustness of our proposed HEL methods relative to traditional bootstrap intervals under disease model misspecification.

4.2.2. Misspecified verification model

To introduce the misspecified verification model, we consider the setting where $D$ , $T$ , and $A$ are simulated as outlined in Section 4.1, but $V$ is simulated such that the true verification process depends on some covariate $C$ that is not included in the verification model. Specifically, $V$ is generated from

P (V = 1) = {\begin{cases} 1, & T + C > t_{c}^{(q)} \\ 1 - q, & T + C \leq t_{c}^{(q)} \end{cases}

(20)

where

C = Z_{3} + ϵ_{3}

Z_{3} \sim N (0, 1)

ϵ_{3} \sim N (0, 0.25)

, and

t_{c}^{(q)}

is the

100 q

-th quantile of the distribution of

T + C

. Then, the misspecified verification model (19) and the correct disease model (18) are used to obtain

{\hat{π}}_{i}

and

{\hat{ρ}}_{i}

, respectively.

We report coverage probabilities and average widths of the proposed confidence intervals using boxplots in Figure 4 and Table 7 (in Appendix A.2, see supplemental material). We can see that when the verification model is misspecified, HEL2-FI and HEL2-MSI intervals once again yield the best coverage while preserving relatively narrow widths. BP–SPE achieves similar coverage but at the cost of wider intervals. HEL1-FI and HEL1-MSI continue to exhibit slight over-coverage and produce comparatively wider intervals.

Figure 4.

Coverage probabilities and average widths of the proposed 95% level CIs: misspecified verification model. CI: confidence interval.

In summary, across all four simulation scenarios, HEL2 intervals generally exhibit the strongest performance. In particular, HEL2-FI and HEL2-MSI perform best when the disease model is correctly specified, achieving high coverage with relatively narrow intervals. HEL2-SPE intervals provide reliable coverage across all scenarios, though occasionally at the expense of precision. HEL1 intervals typically show slight over-coverage with wider intervals. Overall, HEL2-FI and HEL2-MSI are recommended when the disease model is correctly specified, whereas HEL2-SPE intervals are preferable otherwise.

Figure 5.

Coverage probabilities and average widths of the proposed 95% level CIs: misspecified disease and verification model. CI: confidence interval.

Figure 6.

Coverage probabilities of the proposed 95% level CIs: sensitivity analysis, $n = 100$ . CI: confidence interval.

Figure 7.

Coverage probabilities of the proposed 95% level CIs: sensitivity analysis, $n = 200$ . CI: confidence interval.

Figure 8.

Coverage probabilities of the proposed 95% level CIs: sensitivity analysis, $n = 500$ . CI: confidence interval.

4.3. Sensitivity analyses

So far, our proposed approaches rely on two key assumptions: that disease status is MAR and that at least one of the disease or verification models is correctly specified. To assess the robustness of our methods when either assumption is violated, we conduct a sensitivity analysis.

We first examine the scenario in which both the disease model $ρ$ and the verification model $π$ are misspecified, combining the first misspecified disease model from Section 4.2.1 with the misspecified verification model from Section 4.2.2. Figure 5 presents coverage probabilities and average interval widths for the proposed confidence intervals using boxplots. Under this double-misspecification setting, we can see that intervals based on the SPE estimator continue to outperform those based on the FI, MSI, and IPW estimators, despite SPE being the only method that depends on both models. Notably, HEL2-SPE achieves the most accurate coverage while maintaining relatively narrow interval widths.

Next, we consider the scenario in which the MAR assumption is violated, meaning that the verification $V$ also depends on the subjects’ true disease status $D$ . We follow a setup similar to Liu and Zhou,³² where the verification model is generated by

\begin{aligned} P (V = 1) = \frac{e^{T + A + v_{α} D}}{1 + e^{T + A + v_{α} D}}, \end{aligned}

where

A \sim Unif (- 1, 1)

and

T, D, ρ, π

are obtained as in the correctly specified model in Section 4.1. The parameter

v_{α}

controls the extent to which verification depends on the true disease status. The odds ratio of being verified for diseased versus non-diseased subjects is

e^{v_{α}}

. We examine

v_{α} \in {- 1, - 0.5, 0, 0.5, 1, 1.5, 2}

. When

v_{α} = 0

, it reduces to a MAR problem; otherwise, it is a missing-not-at-random (MNAR) or non-ignorable (NI) missingness. Negative values of

v_{α}

indicate that diseased subjects are less likely to be verified, whereas positive values indicate that diseased subjects are more likely to be verified compared with healthy subjects. In this analysis, we use a setting with a disease prevalence of 0.3 and a true AUC of 0.803. For each configuration, 1000 random samples are generated from the underlying distributions with sample sizes

n = 100, 200, 500

. The resulting coverage probabilities of the proposed confidence intervals are shown in Figures 6 to 8. For the sake of brevity, we put the associated results for average width in Figures 10 to 12 (Appendix A.3, see supplemental material).

We can see that when the sample size is small ( $n = 100$ ), the performance of HEL intervals remains stable even when verification heavily depends on the true disease status. When the sample size is large ( $n = 500$ ), a clear pattern emerges in which coverage decreases as verification becomes more dependent on disease status. However, coverage remains acceptable (greater than 0.8) when verification depends moderately on the true disease status (when $| v_{α} | \leq 1$ , corresponding to odds ratios between approximately 0.37 and 2.7). Overall, the sensitivity analysis shows that our proposed methods, particularly the HEL2 intervals, demonstrate reasonable robustness under mild to moderate violations of the assumptions.

5. Real data analysis

In this section, we apply the proposed methods to a real diagnostic test dataset, the Wisconsin Diagnostic Breast Cancer (WDBC) dataset. WDBC was created by Dr William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian from the University of Wisconsin in 1995.^33,34 In this dataset, features were computed from a digitized image of a fine needle aspirate of breast masses, and the verified diagnosis (malignant or benign) of breast masses was recorded. There are 569 instances of breast masses, each with 32 attributes, including their IDs, true diagnoses, and 30 real-valued input features. There are no missing values in the dataset. Here, we construct a subset of the WDBC data that resembles data that would be obtained from an unbalanced sample of diseased and non-diseased subjects, with 344 benign subjects and 86 malignant subjects, so the disease prevalence is 0.2.

Since each feature can be regarded as an independent continuous-scale diagnostic test for breast cancer and we have the verified disease status $D$ for all subjects, we can compute their AUCs with the full data and select features with reasonable AUCs to assess the performance and robustness of the proposed estimation methods. In this study, we choose four features: mean texture (biomarker I: field 4 in the dataset, AUC = 0.781), mean smoothness (biomarker II: field 7 in the dataset, AUC = 0.708), mean concavity (biomarker III: field 9 in the dataset, AUC = 0.936), and largest concavity (biomarker IV: field 29 in the dataset, AUC = 0.918) as our biomarker $T$ ’s, respectively. Also, we select mean compactness (field 8 in the dataset, AUC = 0.86) as the covariate $A$ . Figure 9 shows the density distributions of the original measurements of the four biomarkers, with the malignant group plotted as a solid line and the benign group as a dashed line. Biomarker measurements are first Box–Cox transformed to improve normality and then standardized to the same scale for comparability before analysis.

Figure 9.

Distributions of original measurements of biomarkers: malignant benign (- - - - - -).

We generate verification status similarly to Alonzo and Pepe.⁷ Let the true verification model be a logit model linear in $T$ and $A$ , that is, $logit (V = 1) = T + A + k$ , where $k$ is a parameter used to adjust the verification proportion. Here we use $k = 1$ and $- 1$ , resulting in an average of approximately 66% and 33% of the subjects receiving disease status verification, respectively. For the disease model, we fit the same probit model (18) as in Section 4.1, even though the true disease model $ρ_{i}$ is unknown in this setting. Moreover, we compare CI estimates obtained through our proposed bias-corrected approaches against CIs produced by existing R packages ignoring verification bias. We examine two R packages, pROC³⁵ and rocbc,³⁶ which calculate the AUC and its associated CIs in the complete-case (CC) framework, using only data from subjects with complete information. Specifically, the rocbc package employs the Box–Cox transformation, and the pROC package computes the variance of the AUC as defined by DeLong et al.³⁷ using the algorithm of Sun and Xu.³⁸ The results are presented in Tables 2 and 3.

Table 2.

WDBC data analysis: 95% CIs for AUC, verification proportion $\approx$ 66%.

Biomarker	AUC $^{a}$	pROC $^{b}$	rocbc $^{b}$	$a$	BC $^{c}$	BP $^{c}$	HEL1	HEL2
I	0.781	(0.659, 0.782)	(0.654, 0.773)	FI	(0.724, 0.822)	(0.722, 0.82)	(0.716, 0.816)	(0.72, 0.813)
				MSI	(0.727, 0.828)	(0.725, 0.826)	(0.718, 0.823)	(0.723, 0.82)
				IPW	(0.696, 0.812)	(0.693, 0.81)	(0.697, 0.818)	(0.702, 0.815)
				SPE	(0.742, 0.849)	(0.741, 0.847)	(0.732, 0.823)	(0.731, 0.823)
II	0.708	(0.565, 0.706)	(0.566, 0.7)	FI	(0.628, 0.762)	(0.626, 0.762)	(0.617, 0.755)	(0.62, 0.753)
				MSI	(0.634, 0.77)	(0.633, 0.77)	(0.633, 0.766)	(0.636, 0.764)
				IPW	(0.647, 0.79)	(0.643, 0.787)	(0.613, 0.78)	(0.615, 0.778)
				SPE	(0.704, 1)	(0.723, 1)	(0.61, 0.869)	(0.533, 1)
III	0.936	(0.889, 0.952)	(0.876, 0.941)	FI	(0.885, 0.958)	(0.884, 0.955)	(0.876, 0.95)	(0.877, 0.95)
				MSI	(0.911, 0.965)	(0.909, 0.963)	(0.9, 0.959)	(0.902, 0.959)
				IPW	(0.921, 0.968)	(0.919, 0.967)	(0.905, 0.962)	(0.906, 0.961)
				SPE	(0.922, 0.969)	(0.92, 0.966)	(0.914, 0.967)	(0.915, 0.966)
IV	0.918	(0.847, 0.922)	(0.82, 0.903)	FI	(0.853, 0.932)	(0.853, 0.93)	(0.845, 0.925)	(0.846, 0.924)
				MSI	(0.882, 0.945)	(0.88, 0.942)	(0.876, 0.939)	(0.875, 0.939)
				IPW	(0.902, 0.954)	(0.9, 0.952)	(0.877, 0.945)	(0.873, 0.947)
				SPE	(0.905, 0.954)	(0.903, 0.952)	(0.891, 0.946)	(0.889, 0.946)

WDBC: Wisconsin Diagnostic Breast Cancer; CI: confidence interval; AUC: area under the receiver operating characteristic curve; FI: full imputation; MSI: mean score imputation; IPW: inverse probability weighting; SPE: semi-parametric efficient.

$^{a}$ Estimated AUC of the biomarker using the full dataset with no missingness.

$^{b}$ Confidence intervals given by the R packages pROC and rocbc using only data from subjects with verified disease status.

$^{c}$ For the BC and BP methods, we use $B = 1000$ bootstrap replicates.

Table 3.

WDBC data analysis: 95% CIs for $A U C$ , verification proportion $\approx$ 33%.

Biomarker	$A U C$	pROC	rocbc	$a$	BC	BP	HEL1	HEL2
I	0.781	(0.537, 0.72)	(0.539, 0.714)	FI	(0.678, 0.817)	(0.676, 0.813)	(0.664, 0.807)	(0.662, 0.808)
				MSI	(0.679, 0.821)	(0.676, 0.815)	(0.667, 0.808)	(0.661, 0.812)
				IPW	(0.636, 0.865)	(0.625, 0.853)	(0.584, 0.841)	(0.613, 0.831)
				SPE	(0.69, 0.912)	(0.693, 0.912)	(0.689, 0.856)	(0.709, 0.839)
II	0.708	(0.512, 0.702)	(0.509, 0.688)	FI	(0.626, 0.816)	(0.617, 0.811)	(0.605, 0.8)	(0.601, 0.802)
				MSI	(0.634, 0.823)	(0.624, 0.818)	(0.628, 0.811)	(0.626, 0.812)
				IPW	(0.621, 0.846)	(0.61, 0.833)	(0.576, 0.85)	(0.603, 0.834)
				SPE	(0.597, 1)	(0.594, 1)	(0.624, 1)	(0.571, 1)
III	0.936	(0.819, 0.934)	(0.81, 0.92)	FI	(0.845, 0.962)	(0.841, 0.955)	(0.83, 0.947)	(0.829, 0.947)
				MSI	(0.863, 0.968)	(0.859, 0.96)	(0.832, 0.953)	(0.838, 0.952)
				IPW	(0.903, 0.978)	(0.898, 0.971)	(0.752, 0.978)	(0.836, 0.966)
				SPE	(0.906, 0.987)	(0.904, 0.983)	(0.816, 1)	(0.854, 1)
IV	0.918	(0.75, 0.887)	(0.734, 0.869)	FI	(0.806, 0.924)	(0.8, 0.918)	(0.789, 0.915)	(0.796, 0.911)
				MSI	(0.824, 0.934)	(0.813, 0.926)	(0.811, 0.923)	(0.818, 0.92)
				IPW	(0.883, 0.957)	(0.877, 0.952)	(0.838, 0.952)	(0.846, 0.95)
				SPE	(0.92, 1)	(0.92, 1)	(0.83, 1)	(0.791, 1)

From Tables 2 and 3, we can see that when the verification proportion is around 66%, the rocbc package fails to produce correct CIs for biomarkers I, II, and IV, and the pROC package fails to produce a correct CI for biomarker II. When the verification proportion drops to 33%, both of them fail entirely as complete-case methods. In contrast, our proposed bias-corrected approaches are still able to provide correct intervals. For normal-like populations (biomarkers I and II), MSI- and FI-based intervals perform comparably well and are more precise than IPW-based intervals. The BC, BP, HEL1, and HEL2 methods yield similar results, with HEL2 showing a modest precision advantage. In skewed populations (biomarkers III and IV), MSI-based intervals remain robust while FI-based intervals become unreliable (Table 3, biomarker IV). IPW-based intervals now surpass MSI-based intervals in precision, especially when the verification proportion is low. SPE-based intervals, however, perform erratically. While adequate for biomarkers I, III, and IV, they fail or only succeed by sacrificing precision for biomarker II. This aligns with the results from the simulation studies, where we have seen greater variability in average width for SPE-based intervals. Despite this, we note that the BP-SPE and BC-SPE intervals exhibit superior precision in skewed populations (biomarkers III and IV).

6. Conclusion and discussion

The main contribution of this paper is to propose four bias-corrected confidence intervals for the AUC under the MAR assumption: two bootstrap-based approaches and two empirical likelihood (EL)-based approaches. These methods are designed to accommodate partial verification settings, in which only a subset of subjects has confirmed disease status, by leveraging the bias-corrected ROC estimators proposed by Alonzo and Pepe.⁷ Through extensive simulation studies and a real-world data application, we demonstrate that the proposed HEL2 methods generally outperform other methods, and SPE-based intervals exhibit strong robustness under model misspecification. Based on our findings, when the distributions of test results in both disease and non-disease groups are approximately normal or can be made approximately normal through appropriate transformations (e.g. Box–Cox), the HEL2-MSI interval is recommended. Conversely, when normality assumptions are violated and no transformation is feasible, the HEL2-SPE method can serve as a reliable alternative. These recommendations aim to support practitioners in choosing appropriate interval estimation strategies for the AUC in real-world diagnostic studies affected by verification bias.

We conclude the paper with a discussion. First, while we recommend the use of SPE-based intervals and emphasize their “double robustness” to model misspecification, it is important to note that the SPE estimator of the ROC curve may exhibit non-monotonic behavior because both the weights ${\hat{w}}_{S P E, i}^{N}$ and ${\hat{w}}_{S P E, i}^{D}$ can be negative. Consequently, the resulting SPE-based confidence intervals may perform erratically, as observed in our real data analysis. To address this issue, isotonic regression³⁹ can be applied to enforce monotonicity in the estimated ROC curve, thereby improving the reliability of the interval estimates. Second, based on the simulation results, the HEL2 methods generally outperform the HEL1 methods in our study. We emphasize, however, that this is an empirical finding rather than a theoretical guarantee. The HEL2 procedure relies on a scaled- $χ^{2}$ approximation, which requires estimating an unknown scale factor. The quality of this approximation may depend on the adequacy of the scaled- $χ^{2}$ fit in finite samples.

Next, we discuss some limitations of our work and outline directions for future research. While our proposed approaches primarily address two-class classification problems, an increasingly relevant scenario involves diagnostic testing for diseases that progress through multiple stages (e.g. Parkinson’s disease and Alzheimer’s disease screening). In such contexts, a multi-class framework is required, where the ROC surface and the volume under the surface, as generalizations of the ROC curve and the AUC, provide a more comprehensive measure of a test’s ability to distinguish among three or more disease states.^15,40–42 It would be interesting to investigate how our approaches adapt to and perform in these multi-class diagnostic settings.

Another limitation of this work lies in its reliance on the MAR assumption for addressing verification bias. In clinical settings, especially when data are not missing by design, disease verification might depend on unobserved factors that are related to disease status, leading to missing-not-at-random (MNAR, or non-ignorable, NI) missingness. While the sensitivity analysis has shown that our approaches demonstrate reasonable robustness under mild dependence on disease status, methods are still needed for settings with stronger dependence. Some unified frameworks that accommodate both the NI and MAR mechanisms have been proposed, demonstrating that MAR can be viewed as a special case within the broader NI paradigm; see Alonzo,⁴³ Rotnitzky et al.,⁴⁴ Fluss et al.,⁴⁵ Liu and Zhou,³² Yu et al.,⁴⁶ and To.⁴⁷ This theoretical unification suggests promising directions for extending our MAR-based approach within these more general frameworks.

Finally, while the AUC provides a measure of a test’s overall diagnostic ability across all possible cutoff points, another key metric in ROC analysis, the Youden index, offers a direct assessment of the maximum correct classification rate a test can achieve. More importantly, it serves as a criterion for identifying the optimal cutoff point that yields this maximum performance. A promising direction for future research is to extend our EL-based approaches to construct bias-corrected confidence intervals for the Youden index. Some work has already addressed this problem,^48–50 and further exploration could enhance the precision and interpretability of diagnostic decision-making.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802261455678 - Supplemental material for Empirical likelihood inference for the area under the receiver operating characteristic (ROC) curve with verification biased data

Supplemental material, sj-pdf-1-smm-10.1177_09622802261455678 for Empirical likelihood inference for the area under the receiver operating characteristic (ROC) curve with verification biased data by Shirui Wang, Shuangfei Shi and Gengsheng Qin in Statistical Methods in Medical Research

Footnotes

ORCID iD

Gengsheng Qin

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental material

Supplemental material for this article is available online.

References

Bamber

. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psychol 1975; 12: 387–415.

Hanley

McNeil

. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982; 143: 29–36.

Goodenough

Rossmann

Lusted

. Radiographic applications of receiver operating characteristic (ROC) curves. Radiology 1974; 110: 89–95.

Begg

Greenes

. Assessment of diagnostic tests when disease verification is subject to selection bias. Biometrics 1983; 39: 207–215.

Little

Rubin

. Statistical analysis with missing data. Hoboken, NJ: John Wiley & Sons, 2019.

Dong

Peng

. Principled missing data methods for researchers. Springerplus 2013; 2: 1–7.

Alonzo

Pepe

. Assessing accuracy of a continuous screening test in the presence of verification bias. J R Stat Soc C 2005; 54: 173–190.

Lyness

McDermott

. Direct estimation of the area under the receiver operating characteristic curve in the presence of verification bias. Stat Med 2009; 28: 361–376.

Adimari

Chiogna

. Nonparametric verification bias-corrected inference for the area under the ROC curve of a continuous scale diagnostic test. Stat Interface 2017; 10: 629–641.

10.

Ghosal

Kleiner

. Bayesian ROC curve estimation under verification bias. Stat Med 2014; 33: 5081–5096.

11.

Hai

Qin

. Direct estimation of the area under the receiver operating characteristic curve with verification biased data. Stat Med 2020; 39: 4789–4820.

12.

Nakas

Bantis

Gatsonis

. ROC analysis under verification bias. In: ROC analysis for classification and prediction in practice. London: Chapman and Hall/CRC, 2023, pp.168–170.

13.

Stahlmann

Reitsma

Zapf

. Missing values and inconclusive results in diagnostic studies—a scoping review of methods. Stat Methods Med Res 2023; 32: 1842–1855.

14.

Umemneku Chikere

Wilson

Graziadio

, et al. Diagnostic test evaluation methodology: a systematic review of methods employed to evaluate diagnostic tests in the absence of gold standard—an update. PLoS ONE 2019; 14: e0223832.

15.

. Statistical evaluation of diagnostic tests under verification bias. PhD Thesis, Universit degli studi di Padova, Italy, 2017.

16.

Owen

. Empirical likelihood ratio confidence intervals for a single functional. Biometrika 1988; 75: 237–249.

17.

Owen

. Empirical likelihood ratio confidence regions. Ann Stat 1990; 18: 90–120.

18.

Fan

Liang

Shen

. Penalized empirical likelihood for high-dimensional partially linear varying coefficient model with measurement errors. J Multivar Anal 2016; 147: 183–201.

19.

Zhang

Fan

. Empirical likelihood inference for time-varying coefficient autoregressive models. Random Matrices Theory Appl 2024; 13: 2450015.

20.

Liu

Zhao

. A review of recent advances in empirical likelihood. WIREs Comput Stat 2023; 15: e1599.

21.

Cui

Chen

. Empirical likelihood confidence region for parameter in the errors-in-variables models. J Multivar Anal 2003; 84: 101–115.

22.

Qin

Zhou

. Empirical likelihood inference for the area under the ROC curve. Biometrics 2006; 62: 613–622.

23.

Qin

Wang

. Imputation-based empirical likelihood inference for the area under the ROC curve with missing data. Stat Interface 2012; 5: 319–329.

24.

Horvitz

Thompson

. A generalization of sampling without replacement from a finite universe. J Am Stat Assoc 1952; 47: 663–685.

25.

Robins

Rotnitzky

Zhao

. Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc 1994; 89: 846–866.

26.

Gao

Hui

Hall

, et al. Estimating disease prevalence from two-phase surveys with non-response at the second phase. Stat Med 2000; 19: 2101–2114.

27.

Alonzo

Pepe

Lumley

. Estimating disease prevalence in two-phase studies. Biostatistics 2003; 4: 313–326.

28.

Pepe

Cai

. The analysis of placement values for evaluating discriminatory measures. Biometrics 2004; 60: 528–535.

29.

Qin

Davis

Jing

. Empirical likelihood-based confidence intervals for the sensitivity of a continuous-scale diagnostic test at a fixed level of specificity. Stat Methods Med Res 2011; 20: 217–231.

30.

Dong

Tian

. Confidence interval estimation for sensitivity to the early diseased stage based on empirical likelihood. J Biopharm Stat 2015; 6: 1215–1233.

31.

Zhou

. emplik: empirical likelihood ratio for censored/truncated data. R package version 1.3-1. Vienna, Austria: R Foundation for Statistical Computing, 2023. Available at: https://CRAN.R-project.org/package=emplik.

32.

Liu

Zhou

. A model for adjusting for nonignorable verification bias in estimation of the ROC curve and its area with likelihood-based approach. Biometrics 2010; 66: 1119–1128.

33.

Street

Wolberg

Mangasarian

. Nuclear feature extraction for breast tumor diagnosis. Biomed Image Process Biomed Visual 1993; 1905: 861–870.

34.

Wolberg

Mangasarian

Street

. Breast Cancer Wisconsin (Diagnostic) [dataset]. 1993. UCI Machine Learning Repository. Available at: https://doi.org/10.24432/C5DW2B.

35.

Robin

Turck

Hainard

, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 2011; 12: 77.

36.

Bantis

Brewer

Nakas

, et al. Statistical inference for Box–Cox based receiver operating characteristic curves. Stat Med 2024; 43: 6099–6122.

37.

DeLong

Clarke-Pearson

. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 1988; 44: 837–845.

38.

Sun

. Fast implementation of DeLong’s algorithm for comparing the areas under correlated receiver operating characteristic curves. IEEE Signal Process Lett 2014; 21: 1389–1393.

39.

Robertson

Wright

Dykstra

. Order restricted statistical inference. New York: Wiley, 1988.

40.

Shi

Qin

. Direct estimation of volume under the ROC surface with verification bias. J Biopharm Stat 2024; 34: 553–581.

41.

Nakas

Yiannoutsos

. Ordered multiple-class ROC analysis with continuous measurements. Stat Med 2004; 23: 3437–3449.

42.

Chiogna

Adimari

. Nonparametric estimation of ROC surfaces under verification bias. Revstat Stat J 2020; 18: 697–720.

43.

Alonzo

. Verification bias: impact and methods for correction when assessing accuracy of diagnostic tests. REVSTAT Stat J 2014; 12: 67–83.

44.

Rotnitzky

Faraggi

Schisterman

. Doubly robust estimation of the area under the receiver-operating characteristic curve in the presence of verification bias. J Am Stat Assoc 2005; 101: 1276–1288.

45.

Fluss

Reiser

Faraggi

, et al. Estimation of the ROC curve under verification bias. Biom J 2009; 51: 475–490.

46.

Kim

Park

. Estimation of area under the ROC curve under nonignorable verification bias. Stat Sin 2018; 28: 2149.

47.

Chiogna

Adimari

, et al. Estimation of the volume under the ROC surface in presence of nonignorable verification bias. Stat Methods Appl 2019; 28: 695–722.

48.

Wang

Shi

Qin

. Interval estimation for the Youden index of a continuous diagnostic test with verification biased data. Stat Methods Med Res 2025; 34: 796–811.

49.

Shi

Wang

Qin

. Interval estimation for three-class Youden index with verification bias. J Biopharm Stat 2025; 1: 1–22.

50.

Jia

Wang

Qin

. Methodological approaches for the estimation of confidence intervals on partial Youden index under verification bias. Pharm Stat 2026; 25: e70079.

51.

Schisterman

Rotnitzky

. Estimation of the mean of a

K

-sample

U

-statistic with missing outcomes and auxiliaries. Biometrika 2001; 88: 713–725.

52.

Datta

Bandyopadhyay

Satten

. Inverse probability of censoring weighted

U

-statistics for right-censored data with an application to testing hypotheses. Scand J Stat 2010; 37: 680–700.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.29 MB

0.00 MB