Inference about the ratio of age-standardized rates between two overlapping populations

Abstract

We develop a robust bias-corrected method of inference about the ratio of age-standardized rates (RASR) for comparing the age-standardized rate (ASR) between a subpopulation and the whole population. Unlike previous methods, the proposed approach does not rely on the proportional age-distribution (PAD) assumption, which is often unrealistic in many situations. Like an existing approach, the method corrects for bias resulting from sampling errors when using sample-based population estimates, instead of census-based populations, as the denominators for estimating ASRs. This broadens the applicability of the proposed method in studying cancer risk factors beyond the basic demographic characteristics. The robust bias-corrected estimator of the RASR and the associated variance estimator and confidence intervals are derived. We show empirically that the proposed RASR estimator performs significantly better than the existing estimator, which relies on the PAD assumption, especially when the latter assumption fails. Specifically, the proposed RASR estimator significantly reduces the bias without increasing the variance. On the other hand, when the PAD assumption holds, our RASR estimator performs similarly to the existing estimator. The proposed method has also shown highly desirable performance when at-risk population estimates used for calculating ASRs are subject to sampling errors. We also show empirically that the proposed variance estimator performs satisfactorily. A real-data application is discussed.

Keywords

bias correction cancer rate confidence intervals correlation ratio of ASRs sampling error variance estimation

1. Introduction

Age-standardized rates (ASRs) of cancer incidence and mortality are essential metrics for monitoring the burden of cancer in a population, for example Sherman et al.¹ Cancer surveillance often involves comparisons of ASRs between a subgroup and its parent group to highlight specific groups with higher risks of cancer. This helps guide resource allocations to ensure prevention and treatment efforts are directed where they are most needed. For example, if a racial group shows an elevated incidence of a specific type of cancer compared to the overall population, public health campaigns can be tailored to raise awareness, improve screening rates, and promote preventive measures within that racial community.² If a specific region exhibits higher rates of lung cancer, this might prompt investigations into environmental factors such as air quality and smoking rates, leading to localized public health initiatives.³

A standard approach of comparing ASRs assumes that the two populations under comparison are exclusive and independent.⁴ This assumption simplifies the estimation process, however, may introduce bias when used to compare overlapping groups. Tiwari et al.⁵ addressed this issue in the context of comparing a state with the U.S. While Tiwari’s method explicitly incorporated the correlation between the two groups, it simplified the estimation by assuming both populations have the same age distribution, that is, the proportional age-distribution (PAD) assumption. Age-standardization adjusts for differences in age structure, ensuring that observed differences in cancer rates are not simply due to variations in age distribution but reflect true differences in risk. However, this simplification not only nullifies the purpose of age-standardization but also overlooks important distinctions among populations. Because of this limitation, official cancer surveillance systems—such as the Surveillance, Epidemiology, and End Results (SEER) Program of the National Cancer Institute (NCI), have not yet incorporated Tiwari’s method for comparing ASRs between overlapping groups in their published rate ratio statistics. A recent study by Zhu et al.⁶ was built upon Tiwari’s method by incorporating spatial autocorrelation when comparing geographic areas, yet it shares the same limitation. Additionally, because it relies on Bayesian calculations, it is not easily integrated into widely used cancer statistics softwares, such as SEER*Stat, which favors estimators available in closed-form expressions.⁷

A further obstacle in accurately identifying elevated ASRs is the presence of sampling errors within the population denominators used for ASR estimation. This issue arises when populations of interest are not available from decennial censuses, necessitating the use of sample surveys to obtain population data with detailed characteristics, such as immigration status, for example, Jiang et al.⁸ While sampling errors had been accounted for in the inference of ASRs and in the ratio of ASRs for non-overlapping populations,^8–10 there remains an absence of methodologies for comparing ASRs between correlated populations. This shortcoming has limited the comprehensiveness of official cancer statistics released by federal agencies.

In this article, we develop a robust statistical approach to simultaneously solving both challenges, that is, (i) that the populations being compared are overlapping, and (ii) the presence of sampling errors in the ASRs. The rest of the article is organized as follows. Section 2 introduces notations and reviews the existing methods. In Section 3, we consider strategies of estimation without the PAD assumption. In particular, we propose a robust bias-corrected ratio estimator without making the PAD assumption. In Section 4, we carry out extensive simulation studies to evaluate the accuracy of the proposed ratio estimator by assessing the relative bias as well as coefficient of variation in comparison with two existing estimators and the three newly developed alternative estimators. We consider the case of no sampling error in the population data (i.e. Tiwari et al., 2010) and the case where the population data involve sampling errors. In both cases, we consider situations where the PAD assumption holds or does not hold. In Section 5, we derive a variance estimator for the proposed robust bias-corrected ratio estimator and evaluate its precision empirically. A real-life example is considered in Section 6. Discussion and concluding remarks are offered in Section 7. Technical details, as well as additional simulation results, are deferred to Supplementary Material.

2. Notations and existing methods

The ASR is defined as a weighted average of component rates of different age groups, that is,

\hat{R} = \sum_{j = 1}^{J} w_{j} \frac{X_{j}}{N_{j}},

(1)

where

X_{j}

is the incidence (e.g. cancer) count for the

j

th age group,

w_{j}

is the age standardization weight, and

N_{j}

the population size for the

j

th age group,

1 \leq j \leq J

J

being the total number of age standardization groups under consideration.

The naive ratio of ASRs (RASR) is defined as

{\hat{θ}}_{N} = \frac{{\hat{R}}_{1}}{{\hat{R}}_{2}},

(2)

where

{\hat{R}}_{1}

{\hat{R}}_{2}

are the ASRs of two comparing populations;⁴ for example, foreign vs native-born Hispanics.⁸ The latter authors introduced a bias-corrected version of RASR and showed that it significantly reduced the bias. The authors also developed a variance estimator for the bias-corrected RASR. Both the point estimator and variance estimator have been integrated into the National Cancer Institute’s Survey-Based Population-Adjusted Rate Calculator (SPARC).

An important common underlying assumption for the naive RASR, (2), as well as that of Jiang et al.⁸ is that ASRs of two comparing populations, ${\hat{R}}_{1}$ and ${\hat{R}}_{2}$ , are independent. This may be unrealistic in many scenarios, particularly where the comparison population is a subset of the reference population. Such a dependent situation has been previously considered in the literature. For example, Tiwari et al.⁵ proposed a method under what they called PAD assumption, that is,

\frac{N_{S, j}}{N_{Ω, j}} = P_{S}, 1 \leq j \leq J,

(3)

where

S

denotes the subpopulation and

Ω

the whole population, and

J

is the number of age groups. Here,

P_{S}

is a population parameter that may be estimated using data at the whole population level.⁵ (3) implies that

\frac{N_{S^{c}, j}}{N_{Ω, j}} = 1 - P_{S}, 1 \leq j \leq J,

(4)

where

S^{c}

represents the complement subpopulation with respect to

S

. Note that we use

S

instead of

X

, which was used in Tiwari et al.⁵ for the subpopulation. The letter

X

is reserved for the numerator counts in our case. As in Tiwari et al.,⁵ we can now write the ASR, for

U = S

S^{c}

, or

Ω

, as follows:

R_{U} = \sum_{j = 1}^{J} w_{j} \frac{X_{U, j}}{N_{U, j}} .

(5)

By (3)–(5), the ASR for

Ω

can be shown, as in Tiwari et al.,⁵ to be

R_{Ω} = P_{S} R_{S} + (1 - P_{S}) R_{S^{c}} .

(6)

Following Jiang et al.,⁸ sec. 2.2, the basic distributional assumption on $X_{U, j}$ is that $X_{U, j} | N_{U, j} \sim Poisson (r_{U, j} N_{U, j})$ , where $r_{U, j}$ is the population rate of cancer incidence for the $j$ th age-group of population $U$ , $1 \leq j \leq J$ . The population ASR for population $U$ is thus $r_{U} = \sum_{j = 1}^{J} w_{j} r_{U, j}$ for $U = S, Ω$ . The quantity of interest for the ratio inference is $θ = r_{S} / r_{Ω}$ , which we call the population RASR.

Tiwari et al. (2010) considered point and interval estimation of $θ$ under the PAD assumption. They consider a point estimator of $θ$ , ${\hat{θ}}_{T}$ is defined as

{\hat{θ}}_{T} = \frac{R_{S}}{R_{Ω}} = \frac{R_{S} / R_{S^{c}}}{\hat{P_{S}} (R_{S} / R_{S^{c}}) + (1 - \hat{P_{S}})} .

(7)

From a practical standpoint, the PAD assumption may not be realistic, even approximately, in some health-related situations. Basically, the assumption says that the proportion of subpopulation is the same across different age groups. Why would this be the case? For example, it is known that the mammography rates for breast cancer (e.g. proportion of women having had mammography over the past year¹¹) are the highest in certain mid-age groups. If the subpopulation are women having had mammography and the whole population are all women, then in such a case the PAD assumption is clearly not reasonable.

It should be noted that Tiwari et al. (2010) assumes there is no sampling error in the denominator of the ASR estimates, in which case one has $E (R_{U}) = r_{U}$ , $U = S, Ω$ , hence $θ = E (R_{S}) / E (R_{Ω})$ . Still, the RASR, (2), is a biased estimator for $θ$ , even if the numerator and denominator ASRs are both unbiased estimators of their corresponding population ASRs, for example, Cochran.¹² Furthermore, in practice, the denominators of the ASRs often involve sampling errors. In such cases, the RASR (2) is, of course, also biased. Our unified approach to estimating $θ$ applies to both cases, that is, the case with no sampling errors in the population denominators and the case with sampling errors in the population denominators. Specifically, our robust bias-corrected estimator, described in Section 3, involves estimates of the sampling variances. In the case of no sampling errors, those estimates are taken as zeros, resulting in a simplified expression for the estimator. More importantly, we investigate the problem without making the PAD assumption.

3. Estimation without PAD assumption

Recall that $X_{S, j}$ , $X_{S^{c}, j}$ are independent such that

X_{U, j} | N_{U, j} \sim Poisson (r_{U, j} N_{U, j}), U = S, S^{c},

(8)

where the

r_{U, j}

s are unknown rates. It then follows that

X_{Ω, j}

is also a Poisson random variable which is the sum of

X_{S, j}

and

X_{S^{c}, j}

. The population RASR can now be expressed as

θ = E (R_{S}) / E (R_{Ω})

. An estimator of

θ

is called an RASR estimator.

We shall explore two strategies of estimating $θ$ without the PAD assumption. The first strategy is still along the age-proportionate set-up of Tiwari et al.⁵ without assuming the proportions are not equal but with a bias-correction extension to account for sampling errors. The second approach completely ignores the proportionate set-up; instead, it derives a bias-corrected version of the naive RASR, (2), which takes into consideration the correlation between the subpopulation and the whole population, as well as the sampling errors.

3.1. Estimation along the age-proportionate set-up

Independent random variables are, in general, easier to handle. Thus, a natural idea is to try to express the quantity of interest, which involves correlated ratios, as a function of independent random variables. Under the age-proportionate set-up,

\frac{N_{S, j}}{N_{Ω, j}} = P_{S, j}, 1 \leq j \leq J,

(9)

where

P_{S, j}

denote the population ratio between

S

and

Ω

for the

j

th age-group. Note that

N_{Ω, j} = N_{S, j} + N_{S^{c}, j}

. Tiwari et al.⁵ proposed the PAD assumption, which means that

P_{S, j} = P_{S}, 1 \leq j \leq J .

(10)

The latter authors assumed that

P_{S}

is known. The authors also assumed that population totals,

N_{S, j}

and

N_{S^{c}, j}

, are known, that is, they do not involve sampling errors.

Under the PAD assumption, we have $N_{S^{c}, j} / N_{Ω, j} = 1 - P_{S}$ . It can be shown that (6) holds. Let $ϕ = E (R_{S}) / E (R_{S^{c}})$ . We can now express $θ$ as

θ = \frac{ϕ}{P_{S} ϕ + 1 - P_{S}}

(11)

Note that, while

θ

involves correlated ratios,

ϕ

involves independent ratios only (

S

and

S^{c}

are independent). Jiang et al.⁸ developed a bias-corrected estimator of

ϕ

{\hat{ϕ}}_{bc}

, that allows sampling errors in the denominators. We can then estimate

θ

by (11) with

ϕ

replaced by

{\hat{ϕ}}_{bc}

, if

P_{S}

is known.

If $P_{S}$ is unknown and true population sizes are available, it can be estimated by:

\hat{P_{S}} = \frac{\sum_{j = 1}^{J} N_{S, j}}{\sum_{j = 1}^{J} N_{S, j} + N_{S^{c}, j}}

(12)

If the

N_{j}

s are not available but instead estimated from a sample survey, hence subject to sampling error, the estimator (12) is biased; in fact, it is biased even if the

N_{j}

s are known, for example, Cochran.¹² A bias-corrected substitution estimator of

θ

, under the PAD assumption but with unknown

P_{S}

, is given by

{\hat{θ}}_{*} = \frac{{\hat{ϕ}}_{bc}}{{\hat{P}}_{S, bc} {\hat{ϕ}}_{bc} + 1 - {\hat{P}}_{S, bc}},

(13)

where

{\hat{P}}_{S, bc}

is a bias-corrected estimator of

P_{S}

whose expression is given in Section A.1 of the Supplementary Material.

To extend the method without the PAD assumption, we continue with (9) but without further assuming (10). It is easy to show

R_{Ω} = \sum_{j = 1}^{J} w_{j} P_{S, j} R_{S, j} + \sum_{j = 1}^{J} w_{j} (1 - P_{S, j}) R_{S^{c}, j},

(14)

Thus, we have the following expression:

\tilde{θ} \equiv \frac{R_{S}}{R_{Ω}} = \frac{\sum_{j = 1}^{J} w_{j} R_{S, j}}{\sum_{j = 1}^{J} w_{j} P_{S, j} R_{S, j} + \sum_{j = 1}^{J} w_{j} (1 - P_{S, j}) R_{S^{c}, j}} .

(15)

A bias-corrected version of

\tilde{θ}

via the method of Jiang et al.,⁸ denoted by

{\tilde{θ}}_{bc}

, is derived in Section A.2 of the Supplementary Material.

An alternative substitution estimator can be obtained by replacing $R_{U, j}$ with its bias-corrected version, ${\hat{R}}_{U, j, bc}$ , $U = S, S^{c}, Ω$ in (15) Jiang et al.⁹ This leads to

{\tilde{θ}}_{*} = \frac{{\hat{R}}_{S, bc}}{{\hat{R}}_{Ω, bc}} = \frac{\sum_{j = 1}^{J} w_{j} {\hat{R}}_{S, j, b c}}{\sum_{j = 1}^{J} w_{j} {\hat{P}}_{S, j, bc} {\hat{R}}_{S, j, bc} + \sum_{j = 1}^{J} w_{j} (1 - {\hat{P}}_{S, j, bc}) {\hat{R}}_{S^{c}, j, bc}} .

(16)

3.2. Estimation not along the age-proportionate set-up

From the development in Section 3.1, it is evident that the development and accuracy of the $θ$ estimator relies strongly on whether $P_{S, j}$ ’s are known, or how they are estimated if they are unknown. In this section, we take a different approach that does not involve the $P_{S, j}$ ’s, hence has nothing to do with the PAD assumption.

Note that the naive RASR, (2), is an estimator of $θ$ , which can be written, in new notation after expanding along the age standardization groups, as

\hat{θ} = \frac{{\hat{R}}_{S}}{{\hat{R}}_{Ω}} = \frac{\sum_{j = 1}^{J} w_{j} (X_{S, j} / {\hat{N}}_{S, j})}{\sum_{j = 1}^{J} w_{j} {(X_{S, j} + X_{S^{c}, j}) / ({\hat{N}}_{S, j} + {\hat{N}}_{S^{c}, j})}} .

(17)

What we are going to do is produce a bias-corrected version of

\hat{θ}

that take into consideration the correlation in ASRs between the subpopulation and whole population in (17). Using similar techniques as those employed by Jiang et al.,⁸ but with a more careful evaluation of the covariance term, which vanishes under independence, we obtain the following bias-corrected version of

\hat{θ}

(see Section A.2 of the Supplementary Material):

{\hat{θ}}_{bc} = \hat{θ} - \frac{\hat{b}}{{\hat{R}}_{Ω}} + \frac{{\hat{R}}_{S} \hat{f} + {\hat{g}}_{13}}{{\hat{R}}_{Ω}^{2}} - \frac{{\hat{R}}_{S} {\hat{g}}_{3}}{{\hat{R}}_{Ω}^{3}} .

(18)

Expression (18) involves several new quantities, whose definitions are given below:

\begin{aligned} \hat{b} = & \sum_{j = 1}^{J} w_{j} (\frac{{\hat{R}}_{S, j} {\hat{V}}_{S, j}}{{\hat{N}}_{S, j}^{2}}), \hat{f} = \sum_{j = 1}^{J} w_{j} (\frac{{\hat{R}}_{Ω, j} {\hat{V}}_{Ω, j}}{{\hat{N}}_{Ω, j}^{2}}), \end{aligned}

(19)

\begin{aligned} {\hat{g}}_{3} = & \sum_{j = 1}^{J} w_{j}^{2} (\frac{{\hat{R}}_{Ω, j}}{{\hat{N}}_{Ω, j}} + \frac{{\hat{R}}_{Ω, j}^{2}}{{\hat{N}}_{Ω, j}^{2}} {\hat{V}}_{Ω, j}), {\hat{g}}_{13} = \sum_{j = 1}^{J} w_{j}^{2} \hat{Cov} ({\hat{R}}_{S, j}, {\hat{R}}_{Ω, j}) . \end{aligned}

(20)

In (19),

{\hat{V}}_{S, j}

{\hat{V}}_{Ω, j}

are estimators of

var ({\hat{N}}_{S, j})

var ({\hat{N}}_{Ω, j})

, respectively,⁸ and

\hat{Cov} ({\hat{R}}_{S, j}, {\hat{R}}_{Ω, j}) = ({\hat{R}}_{S, j} / {\hat{N}}_{Ω, j}) + ({\hat{R}}_{S, j} {\hat{R}}_{Ω, j} / {\hat{N}}_{S, j} {\hat{N}}_{Ω, j}) {\hat{V}}_{S, j}

Note. Expression (18) is derived for the situation where the population data involve sampling errors. In the “ideal” case where the population data do not involve sampling errors, one simply needs to replace $\hat{V}$ s, which are estimates of the sampling variances, with zeros. This results in a simplified expression of ${\hat{θ}}_{bc}$ since $\hat{b} = \hat{f} = 0$ .

4. Simulation studies: comparing RASR estimators

To gain insight into the relative accuracy of various RASR estimators proposed, we carry out a series of simulation studies to compare the magnitudes of bias reduction across several estimators. These include the naive estimator, ${\hat{θ}}_{N}$ , given by (2), that does not account for population overlap nor sampling errors; the existing Tiwari estimator, ${\hat{θ}}_{T}$ , given by (7) that accounts for population overlap but assumes the PAD; and three estimators that account for population overlap without assuming the PAD, that is, the bias-corrected estimator following Jiang et al.⁸ with proportion estimation ${\tilde{θ}}_{bc}$ , the alternative substitution estimator with proportion estimation, ${\tilde{θ}}_{*}$ , given by (16), and the bias-corrected estimator without proportion estimation, ${\hat{θ}}_{bc}$ , given by (18). A summary of the estimators involved is provided in Table 1.

Table 1.
Summary of RASR estimators.

Estimator Pop. Overlap PAD Proportion est. Sampling error Equation Ref.

${\hat{θ}}_{N}$ (Naive) No No No No (2)

${\hat{θ}}_{T}$ (Tiwari) Yes Yes Yes No (7)

${\tilde{θ}}_{*}$ Yes No Yes Yes (16)

${\tilde{θ}}_{bc}$ Yes No Yes Yes Suppl. 1.2

${\hat{θ}}_{bc}$ Yes No No Yes (18)

Estimator	Pop. Overlap	PAD	Proportion est.	Sampling error	Equation Ref.
${\hat{θ}}_{N}$ (Naive)	No	No	No	No	(2)
${\hat{θ}}_{T}$ (Tiwari)	Yes	Yes	Yes	No	(7)
${\tilde{θ}}_{*}$	Yes	No	Yes	Yes	(16)
${\tilde{θ}}_{bc}$	Yes	No	Yes	Yes	Suppl. 1.2
${\hat{θ}}_{bc}$	Yes	No	No	Yes	(18)

Pop. Overlap: whether there is overlap in the numerator/denominator populations; PAD: whether PAD assumption is assumed; proportion est.: whetherestimation of the PAD ratio is needed; sampling error: whether sampling errors are allowed in the denominators of the ASRs; equation ref.: around whatequation number or where the estimator was defined.

Three subsections of simulation studies are presented below. Sections 4.1 and 4.2 focus on the bias, or bias-correction, properties of different estimators. In particular, we show that the proposed bias-corrected estimator, ${\hat{θ}}_{bc}$ , has superior power in terms of the bias-correction over the other estimators. On the other hand, we also would like to make sure that the bias-correction is not at the expense of increasing the variance. The latter is demonstrated in Section 4.3 in our extended simulation studies, when variance (in terms of coefficient of variation), and coverage probability, are also taken into consideration.

4.1. Simulation studies with sampling errors

The first set of simulation studies evaluate the bias reduction when the population totals involve sampling errors. We consider a scenario where the PAD assumption holds and a scenario where it does not hold. Tables 2 and 3 report the percentage relative bias (%RB) based on $K = 10, 000$ simulation runs. The %RB of an estimator, $\overset{ˇ}{θ}$ , is defined as:

% RB = 100 \times {\frac{E (\overset{ˇ}{θ} - θ)}{θ}}

where

θ

is the true parameter value and E is the simulated expectation (i.e. average over the simulation runs).

Table 2.
%RB of different RASR estimators when PAD does not hold.

True value of $θ$

Sampling method Estimator 1.0000 1.0809 1.1760 1.2896

SRS ${\hat{θ}}_{N}$ 0.3432 0.3631 0.3690 0.3425

SRS ${\hat{θ}}_{T}$ 0.0686 0.4958 0.9953 1.4967

SRS ${\tilde{θ}}_{}$ 0.3332 0.3418 0.3467 0.3026

SRS ${\tilde{θ}}_{b c}$ 0.3326 0.3429 0.3404 0.3030

SRS ${\hat{θ}}_{b c}$ −0.0160 −0.0027 −0.0005 −0.0325

STR ${\hat{θ}}_{N}$ 0.4587 0.3164 0.3832 0.3452

STR ${\hat{θ}}_{T}$ 0.1969 0.4076 0.9816 1.5389

STR ${\tilde{θ}}_{}$ 0.4485 0.2993 0.3623 0.3058

STR ${\tilde{θ}}_{b c}$ 0.4477 0.2971 0.3540 0.3063

STR ${\hat{θ}}_{b c}$ 0.0983 −0.0486 0.0126 −0.0289

		True value of $θ$
SRS	${\hat{θ}}_{N}$	0.3432	0.3631	0.3690	0.3425
SRS	${\hat{θ}}_{T}$	0.0686	0.4958	0.9953	1.4967
SRS	${\tilde{θ}}_{*}$	0.3332	0.3418	0.3467	0.3026
SRS	${\tilde{θ}}_{b c}$	0.3326	0.3429	0.3404	0.3030
SRS	${\hat{θ}}_{b c}$	−0.0160	−0.0027	−0.0005	−0.0325
STR	${\hat{θ}}_{N}$	0.4587	0.3164	0.3832	0.3452
STR	${\hat{θ}}_{T}$	0.1969	0.4076	0.9816	1.5389
STR	${\tilde{θ}}_{*}$	0.4485	0.2993	0.3623	0.3058
STR	${\tilde{θ}}_{b c}$	0.4477	0.2971	0.3540	0.3063
STR	${\hat{θ}}_{b c}$	0.0983	−0.0486	0.0126	−0.0289

Table 3.

%RB of different RASR estimators when PAD holds.

		True value of $θ$
Sampling method	Estimator	1.0000	1.0909	1.2000	1.3333
SRS	${\hat{θ}}_{N}$	0.3350	0.2861	0.3664	0.3054
SRS	${\hat{θ}}_{T}$	−0.0008	−0.0540	0.0390	−0.0245
SRS	${\hat{θ}}_{*}$	−0.0009	−0.0528	0.0391	−0.0243
SRS	${\tilde{θ}}_{*}$	0.3302	0.2839	0.3646	0.3048
SRS	${\tilde{θ}}_{b c}$	0.3302	0.2840	0.3647	0.3049
SRS	${\hat{θ}}_{b c}$	−0.0206	−0.0681	0.0112	−0.0495
STR	${\hat{θ}}_{N}$	0.4156	0.1758	0.3100	0.3822
STR	${\hat{θ}}_{T}$	0.0677	−0.1721	−0.0513	0.0359
STR	${\hat{θ}}_{*}$	0.0684	−0.1717	−0.0514	0.0363
STR	${\tilde{θ}}_{*}$	0.4115	0.1727	0.3078	0.3816
STR	${\tilde{θ}}_{b c}$	0.4115	0.1728	0.3079	0.3817
STR	${\hat{θ}}_{b c}$	0.0600	−0.1790	−0.0460	0.0265

The population data is simulated using either simple random sampling (SRS) or stratified random sampling (STR) as the sampling procedure, for example, Lohr.¹³ For both subpopulation $S$ and $S^{c}$ , the same sampling scheme was used, and the sample for the whole population, $Ω$ , is the combination of two subpopulations samples. The population size for $S$ is set to be $N_{S} = 50, 000$ across all $J$ age groups. For the situation where the PAD assumption does not hold, we set the true $P_{S, j}$ to be a sequence from 0.3 to 0.7; for the situation where the PAD assumption holds, we set $P_{S}$ to be 0.5.

As it is easier to control the true ratio $ϕ$ between $R_{S}$ and $R_{S^{c}}$ rather than directly control $θ$ [recall $θ = E (R_{S}) / E (R_{Ω})$ and $ϕ = E (R_{S}) / E (R_{S^{c}})$ ], the true ratio $ϕ$ is set as one of four values: 1.0, 1.2, 1.5, and 2.0. Those values result in the true of $θ$ being 1.0, 1.08, 1.18, and 1.29, respectively, when PAD does not hold; and $θ$ being: 1.0, 1.09, 1.2, and 1.33 when PAD holds. We consider the same $J = 19$ age groups as in Jiang et.al. (2022) with the same age-adjusting weights $w_{j}$ ’s according to the 2000 standard populations. We summarize the steps of simulating the data as follows.

Simulation steps: In the SRS setting, the following procedure is applied independently to both the subpopulation $S$ and the complementary population $S^{c}$ , given the value of $E (R) = r$ ,population size $N$ , and population age structure $P_{age}$ . We set $r_{s}$ by its relationship with $ϕ = E (R_{S}) / E (R_{S^{c}}) = r_{S} / r_{S^{c}}$ with $r_{S^{c}}$ fixed at 0.001, which approximates the crude incidence rate of female breast cancer in the United States, that is, about 100 per 100,000 persons. For subpopulation $S$ , define $N_{j} = [N_{S} w_{j}]$ for $1 \leq j \leq J - 1$ , where $[x]$ denotes the integer part of $x$ , and let $N_{J} = N_{S} - \sum_{j = 1}^{J - 1} N_{j}$ . The values $N_{1}, \dots, N_{J}$ are fixed. We then construct the population age structure $P_{age} = {a_{1}, \dots, a_{N_{S}}}$ such that $a_{i} = 1$ for $1 \leq i \leq N_{1}$ ; $a_{i} = 2$ for $N_{1} + 1 \leq i \leq N_{1} + N_{2}$ ; $\dots$ , and $a_{i} = J$ for $N_{1} + \dots + N_{J - 1} + 1 \leq i \leq N_{S}$ .

For each fixed $j$ ( $j = 1, \dots, J$ ), the following steps are performed: (i) Define $y_{i} = 1$ such that $y_{i} = 1$ if $a_{i} = j$ , and $y_{i} = 0$ otherwise, then $N_{j} = \sum_{i = 1}^{N_{S}} y_{i}$ ; (ii) Generate $X_{j} \sim Poisson (r_{s} N_{j})$ ; (iii) Draw $n = 0.01 N_{S}$ samples without replacement from $y_{i} : 1 \leq i \leq N_{S}$ and compute the sample mean $\bar{y}$ , where 0.01 is the sampling fraction. (iv) Estimate ${\hat{N}}_{j} = N_{S} \bar{y}$ . (v) Estimate the sampling error $V_{S, j}$ via its design based estimator:

{\hat{V}}_{S, j} = \frac{N_{S} {\hat{N}}_{S, j} (N_{S} - {\hat{N}}_{S, j})}{N_{S} - 1} (\frac{1}{n_{j}} - \frac{1}{N_{S}}),

where

n_{j}

is the sample size [i.e. the

n

in step (iii) above]. For

S^{c}

, the procedure is similar.

In the STR setting, the procedure is similar except that the population is first stratified into $H = 4$ strata. The strata of population are created as follows. Generate $N_{S}$ independent random variables $ξ_{1}, \dots, ξ_{N_{S}}$ from $P (ξ_{i} = h) = h / 10$ for $h = 1, 2, 3, 4$ . Let $S_{h} = {i : ξ_{i} = h}$ and denote $N_{st, h} = | S_{h} |$ for $1 \leq i \leq N_{S}$ , where $| \cdot |$ denotes cardinality. Again, for each fixed $j$ ( $j = 1, \dots, J$ ), generate $X_{j}$ as in the SRS case. To estimate ${\hat{N}}_{j}$ , proceed as follows for each stratum $h = 1, \dots, 4$ : (i) extract the subpopulation $P_{h} = {y_{i} : i \in S_{h}}$ ; (ii) draw $n_{h} = 0.01 N_{st, h}$ samples without replacement from $P_{h}$ with the same sampling fraction of 0.01 used for all strata; (iii) Compute the sample mean ${\bar{y}}_{h}$ . Finally, compute ${\hat{N}}_{j} = \sum_{h = 1}^{H} N_{st, h} {\bar{y}}_{h}$ . The sampling error $V_{S, j}$ is also estimated via the design-based estimator:

{\hat{V}}_{S, j} = \sum_{h = 1}^{H} (1 - \frac{n_{h}}{N_{s t, h}}) N_{s t, h}^{2} \frac{\sum_{i \in S_{h}} {(y_{i} - {\bar{y}}_{h})}^{2}}{(n_{h} (n_{h} - 1))} .

Note that the above definition is for a fixed

j

, hence the subscript

j

is used.

Based on the simulation results in Table 2, it is evident that when the PAD assumption does not hold and the true $P_{S}$ is unknown, ${\tilde{θ}}_{*}$ and ${\tilde{θ}}_{bc}$ reduce some bias compared to ${\hat{θ}}_{N}$ but they remain more biased than ${\hat{θ}}_{bc}$ . For $θ = 1.0$ , ${\hat{θ}}_{T}$ outperformed all estimators except ${\hat{θ}}_{bc}$ ; however, for of the other $θ$ values, ${\hat{θ}}_{T}$ had (much) higher %RB than the other estimators. In all cases, ${\hat{θ}}_{bc}$ exhibits the lowest %RB across all five estimators and has a relatively simple expression.

From the simulation results in Table 3, where the PAD assumption holds and the true $P_{S}$ is unknown, ${\hat{θ}}_{T}$ and all three newly proposed estimators outperform ${\hat{θ}}_{N}$ . The %RBs of ${\tilde{θ}}_{bc}$ and ${\tilde{θ}}_{*}$ are quite close, merely slightly better than the naive estimator ${\hat{θ}}_{N}$ . ${\hat{θ}}_{bc}$ outperforms all the other estimators except for ${\hat{θ}}_{T}$ , however, their performance are very similar, with each better than the other under some scenarios.

Additional simulation studies are presented in the Section 4.3.

4.2. Simulation studies without sampling errors

In this subsection, we consider an “ideal” situation where the population data of the ASRs do not involve sampling errors. This is the case considered by Tiwari et al. (2010). Similar to Tables 2 and 3, we present the corresponding results when the denominators of the ASRs do not involve sampling errors. In light of the superior performance observed for ${\hat{θ}}_{bc}$ , we focus on comparing three estimators, ${\hat{θ}}_{N}$ , ${\hat{θ}}_{T}$ , and ${\hat{θ}}_{bc}$ . Tables 4 and 5 present the results for the same population size of $N_{S} =$ 50,000. We further evaluate the impact of a larger population size of $N_{S} =$ 100,000, which approximately equals to the size of a mid-sized county in United States. Results are presented in Tables 6 and 7.

Table 4.
%RB of different RR estimates when PAD does not hold (no sampling errors in the denominators): $N_{S} =$ 50,000.

True value of $θ$

Estimator 0.4260 1.0000 1.0809 1.1909 1.3167 1.5078

${\hat{θ}}_{N}$ −0.0198 −0.0122 −0.0264 −0.0204 −0.0439 0.0553

${\hat{θ}}_{T}$ −2.7438 −0.0703 −0.1727 −0.3477 −0.6083 2.7703

${\hat{θ}}_{bc}$ 0.0046 −0.0122 −0.0260 −0.0195 −0.0426 0.0506

	True value of $θ$
${\hat{θ}}_{N}$	−0.0198	−0.0122	−0.0264	−0.0204	−0.0439	0.0553
${\hat{θ}}_{T}$	−2.7438	−0.0703	−0.1727	−0.3477	−0.6083	2.7703
${\hat{θ}}_{bc}$	0.0046	−0.0122	−0.0260	−0.0195	−0.0426	0.0506

Table 5.

%RB of different RR estimates when PAD holds (no sampling errors in the denominators): $N_{S} =$ 50,000.

	True value of $θ$
Estimator	0.4000	1.0000	1.0909	1.2000	1.3333	1.6000
${\hat{θ}}_{N}$	−0.1146	−0.0581	0.0171	−0.0038	−0.0265	0.0469
${\hat{θ}}_{T}$	−0.1146	−0.0581	0.0171	−0.0038	−0.0265	0.0469
${\hat{θ}}_{bc}$	−0.1146	−0.0581	0.0171	−0.0038	−0.0265	0.0469

Table 6.

%RB of different RR estimators when PAD does not hold (no sampling errors in the denominators): $N_{S} =$ 100,000.

	True value of $θ$
Estimator	0.4260	1.0000	1.0809	1.1909	1.3167	1.5078
${\hat{θ}}_{N}$	0.0465	0.0012	0.0714	0.0385	−0.0159	0.0187
${\hat{θ}}_{T}$	−2.7953	0.0625	0.5605	0.9734	1.5417	2.7271
${\hat{θ}}_{bc}$	0.0345	0.0012	0.0697	0.0350	−0.0113	0.0163

Table 7.

%RB of different estimators when PAD holds (no sampling errors in the denominators): $N_{S} =$ 100,000.

	True value of $θ$
Estimator	0.4000	1.0000	1.0909	1.2000	1.3333	1.6000
${\hat{θ}}_{N}$	0.0386	0.0018	0.0239	0.0838	−0.0745	0.0266
${\hat{θ}}_{T}$	0.0386	0.0018	0.0239	0.0838	−0.0745	0.0266
${\hat{θ}}_{bc}$	0.0386	0.0018	0.0239	0.0838	−0.0745	0.0266

It is seen that there is no difference in the results, up to the 4th digit, among the three estimators when PAD holds. (It should be noted that there are differences beyond the 4th digit, so the results are not exactly the same.) This shows that, when there is no sampling error in the denominators and the PAD holds, the different estimation strategies, or bias correction, yield essentially no improvement, one over the other. On the other hand, when the PAD assumption does not hold, both ${\hat{θ}}_{N}$ and ${\hat{θ}}_{bc}$ significantly outperform ${\hat{θ}}_{T}$ across all scenarios; however, ${\hat{θ}}_{bc}$ performs only slightly better than ${\hat{θ}}_{N}$ .

4.3. Extended simulation studies with sampling errors

In this subsection, we expand the simulation study presented in Section 4.1 by exploring the scenario presented in Jiang et al.⁹ In this scenario, the PAD assumption does not hold, and $\hat{N}$ is directly generated from a distribution governed by a parameter $ρ$ , which basically is the coefficient of variation (CV) of the simulated ${\hat{N}}_{j}$ s. These CVs may be identical or distinct for $S$ and $S^{c}$ . The results are presented in Table 8.

Table 8.
Performance of RASR estimators (sampling errors in the population data).

$ρ = 0.05$ $ρ = 0.2$

$N_{S}$ $θ$ Est. %RB Var CV %RB Var CV

50k 0.4260 ${\hat{θ}}_{N}$ 0.0452 0.0057 0.1768 2.3026 0.0064 0.1840

50k 0.4260 ${\hat{θ}}_{T}$ −2.8559 0.0057 0.1831 −2.4699 0.0063 0.1909

50k 0.4260 ${\tilde{θ}}_{}$ 0.0823 0.0057 0.1831 2.3981 0.0063 0.1909

50k 0.4260 ${\tilde{θ}}_{bc}$ 0.0709 0.0057 0.1768 2.3932 0.0064 0.1840

50k 0.4260 ${\hat{θ}}_{bc}$ −0.0465 0.0057 0.1768 0.4442 0.0062 0.1840

50k 1.5079 ${\hat{θ}}_{N}$ 0.1726 0.0012 0.0230 2.2223 0.0042 0.0419

50k 1.5079 ${\hat{θ}}_{T}$ 2.7865 0.0014 0.0244 2.8064 0.0041 0.0410

50k 1.5079 ${\tilde{θ}}_{}$ 0.1689 0.0014 0.0245 2.1931 0.0040 0.0410

50k 1.5079 ${\tilde{θ}}_{bc}$ 0.1627 0.0012 0.0230 2.1712 0.0042 0.0422

50k 1.5079 ${\hat{θ}}_{bc}$ 0.0527 0.0012 0.0230 0.3368 0.0040 0.0418

100k 0.4260 ${\hat{θ}}_{N}$ 0.2485 0.0028 0.1235 2.2450 0.0033 0.1313

100k 0.4260 ${\hat{θ}}_{T}$ −2.6817 0.0028 0.1283 −2.7106 0.0033 0.1376

100k 0.4260 ${\tilde{θ}}_{}$ 0.2715 0.0028 0.1283 2.3540 0.0033 0.1376

100k 0.4260 ${\tilde{θ}}_{bc}$ 0.2628 0.0028 0.1235 2.3207 0.0033 0.1313

100k 0.4260 ${\hat{θ}}_{bc}$ 0.1444 0.0028 0.1235 0.3723 0.0032 0.1313

100k 1.5079 ${\hat{θ}}_{N}$ 0.1069 0.0007 0.0172 2.1879 0.0036 0.0389

100k 1.5079 ${\hat{θ}}_{T}$ 2.7062 0.0008 0.0184 2.7821 0.0034 0.0377

100k 1.5079 ${\tilde{θ}}_{}$ 0.1013 0.0008 0.0184 2.3245 0.0034 0.0377

100k 1.5079 ${\tilde{θ}}_{bc}$ 0.0994 0.0007 0.0172 2.1387 0.0037 0.0392

100k 1.5079 ${\hat{θ}}_{bc}$ −0.0113 0.0007 0.0172 0.3046 0.0034 0.0388

			$ρ = 0.05$	$ρ = 0.2$
50k	0.4260	${\hat{θ}}_{N}$	0.0452	0.0057	0.1768	2.3026	0.0064	0.1840
50k	0.4260	${\hat{θ}}_{T}$	−2.8559	0.0057	0.1831	−2.4699	0.0063	0.1909
50k	0.4260	${\tilde{θ}}_{*}$	0.0823	0.0057	0.1831	2.3981	0.0063	0.1909
50k	0.4260	${\tilde{θ}}_{bc}$	0.0709	0.0057	0.1768	2.3932	0.0064	0.1840
50k	0.4260	${\hat{θ}}_{bc}$	−0.0465	0.0057	0.1768	0.4442	0.0062	0.1840
50k	1.5079	${\hat{θ}}_{N}$	0.1726	0.0012	0.0230	2.2223	0.0042	0.0419
50k	1.5079	${\hat{θ}}_{T}$	2.7865	0.0014	0.0244	2.8064	0.0041	0.0410
50k	1.5079	${\tilde{θ}}_{*}$	0.1689	0.0014	0.0245	2.1931	0.0040	0.0410
50k	1.5079	${\tilde{θ}}_{bc}$	0.1627	0.0012	0.0230	2.1712	0.0042	0.0422
50k	1.5079	${\hat{θ}}_{bc}$	0.0527	0.0012	0.0230	0.3368	0.0040	0.0418
100k	0.4260	${\hat{θ}}_{N}$	0.2485	0.0028	0.1235	2.2450	0.0033	0.1313
100k	0.4260	${\hat{θ}}_{T}$	−2.6817	0.0028	0.1283	−2.7106	0.0033	0.1376
100k	0.4260	${\tilde{θ}}_{*}$	0.2715	0.0028	0.1283	2.3540	0.0033	0.1376
100k	0.4260	${\tilde{θ}}_{bc}$	0.2628	0.0028	0.1235	2.3207	0.0033	0.1313
100k	0.4260	${\hat{θ}}_{bc}$	0.1444	0.0028	0.1235	0.3723	0.0032	0.1313
100k	1.5079	${\hat{θ}}_{N}$	0.1069	0.0007	0.0172	2.1879	0.0036	0.0389
100k	1.5079	${\hat{θ}}_{T}$	2.7062	0.0008	0.0184	2.7821	0.0034	0.0377
100k	1.5079	${\tilde{θ}}_{*}$	0.1013	0.0008	0.0184	2.3245	0.0034	0.0377
100k	1.5079	${\tilde{θ}}_{bc}$	0.0994	0.0007	0.0172	2.1387	0.0037	0.0392
100k	1.5079	${\hat{θ}}_{bc}$	−0.0113	0.0007	0.0172	0.3046	0.0034	0.0388

Although our purpose is to develop bias-corrected estimators, we would like to make sure, ideally, that the bias-correction does not significantly increase the variance; otherwise, there may be a concern about “bias-variance trade-off”. The performance in terms of the bias correction is measured by the %RB. As for the performance in terms of the variance, note that the quantities we are dealing with are rates, which themselves may be small. Thus, to take this into consideration, the CV is considered, which corresponds to the relative standard deviation with respect to the mean. The CV of an estimator, $\overset{ˇ}{θ}$ , is defined as

C V = \frac{\sqrt{var (\overset{ˇ}{θ})}}{| E (\overset{ˇ}{θ}) |},

where E and var denote the simulated mean and variance, respectively.

Once more, it is evident that ${\hat{θ}}_{bc}$ outperforms all other estimators in terms of %RB. Notably, the naive estimator ${\hat{θ}}_{N}$ has a relatively better performance than the other estimators except ${\hat{θ}}_{bc}$ particularly when $θ$ is small. Among all of the estimators, ${\hat{θ}}_{T}$ exhibits the highest %RB, which can be attributed to the violation of the PAD assumption. In terms of the CV, all estimators performed similarly when evaluated under the same value of $N_{S}$ .

Next, we consider a simulation setting similar to that reported in Table 2 but utilizing the same true $θ$ values as in Table 8. The results are reported in Table 9.

Table 9.

Performance of RASR estimators under SRS/STR setting (sampling errors in the population data).

			SRS			STR
$N_{S}$	$θ$	Est.	%RB	Var	CV	%RB	Var	CV
50K	0.4260	${\hat{θ}}_{N}$	0.1239	0.0057	0.1776	0.5335	0.0056	0.1746
50K	0.4260	${\hat{θ}}_{T}$	−2.9896	0.0058	0.1841	−2.6279	0.0056	0.1810
50K	0.4260	${\tilde{θ}}_{*}$	0.1682	0.0058	0.1840	0.5866	0.0056	0.1810
50K	0.4260	${\tilde{θ}}_{bc}$	0.1641	0.0057	0.1775	0.5713	0.0056	0.1745
50K	0.4260	${\hat{θ}}_{bc}$	−0.1998	0.0057	0.1775	0.2049	0.0056	0.1745
50K	1.5079	${\hat{θ}}_{N}$	0.3956	0.0015	0.0254	0.3498	0.0012	0.0224
50K	1.5079	${\hat{θ}}_{T}$	2.7186	0.0017	0.0265	2.6781	0.0013	0.0231
50K	1.5079	${\tilde{θ}}_{*}$	0.3598	0.0017	0.0265	0.3209	0.0013	0.0231
50K	1.5079	${\tilde{θ}}_{bc}$	0.3437	0.0015	0.0253	0.2981	0.0011	0.0224
50K	1.5079	${\hat{θ}}_{bc}$	0.0215	0.0015	0.0253	−0.0241	0.0011	0.0223
100k	0.4260	${\hat{θ}}_{N}$	0.2653	0.0028	0.1236	0.2237	0.0028	0.1246
100k	0.4260	${\hat{θ}}_{T}$	−2.7235	0.0028	0.1281	−2.7805	0.0029	0.1291
100k	0.4260	${\tilde{θ}}_{*}$	0.3011	0.0028	0.1281	0.2541	0.0029	0.1291
100k	0.4260	${\tilde{θ}}_{bc}$	0.2882	0.0028	0.1236	0.2456	0.0028	0.1245
100k	0.4260	${\hat{θ}}_{bc}$	0.1073	0.0028	0.1236	0.0644	0.0028	0.1245
100k	1.5079	${\hat{θ}}_{N}$	0.1808	0.0008	0.0182	0.1863	0.0006	0.0156
100k	1.5079	${\hat{θ}}_{T}$	2.7077	0.0008	0.0190	2.7000	0.0006	0.0162
100k	1.5079	${\tilde{θ}}_{*}$	0.1674	0.0009	0.0190	0.1681	0.0006	0.0162
100k	1.5079	${\tilde{θ}}_{bc}$	0.1570	0.0008	0.0182	0.1625	0.0006	0.0156
100k	1.5079	${\hat{θ}}_{bc}$	−0.0021	0.0008	0.0182	0.0033	0.0006	0.0156

A similar pattern is observed, that is, in most cases, ${\hat{θ}}_{bc}$ is seen to the best performer, and ${\hat{θ}}_{T}$ performs the worst in terms of %RB in all cases. Also, note that ${\tilde{θ}}_{*}$ and ${\tilde{θ}}_{bc}$ have very similar performance. In terms of CV, all estimators performed similarly under the same $N_{p_{1}}$ .

The only exception, in which ${\hat{θ}}_{bc}$ did not perform the best in terms of %RB, is the case with $N_{s} =$ 50K and $θ = 0.4260$ . There are two possible explanations. First, ${\hat{θ}}_{bc}$ is more effective in terms of bias-correction when the biases of the naive estimator, ${\hat{θ}}_{N}$ , is relatively large. It is noted that, in this case, the naive estimator actually performed quite well, ans so did ${\tilde{θ}}_{*}$ and ${\tilde{θ}}_{bc}$ . This led to ${\hat{θ}}_{bc}$ over-correcting the bias, resulting a negative %RB that is larger in absolute value than those of ${\hat{θ}}_{N}$ , ${\tilde{θ}}_{*}$ and ${\tilde{θ}}_{bc}$ . Second, it is noted that the CVs of all five estimators are about the same, and they are all relatively large in this case. A consequence of this is that the estimators are relatively unstable, which can impact their empirical performances.

See Section A.4 of the Supplementary Material for additional simulation results.

5. Variance estimator for

{\hat{θ}}_{bc}

In view of the simulation results in the previous section, the preferred RASR estimator is ${\hat{θ}}_{bc}$ , the bias-corrected RASR estimator developed without the PAD assumption. Thus, in this section, we further develop a variance estimator for ${\hat{θ}}_{bc}$ to provide a measure of uncertainty. Using a similar derivation method to that in Jiang et al.,⁸ an approximately unbiased estimator of $var ({\hat{θ}}_{bc})$ can be obtained as

\hat{Var} ({\hat{θ}}_{b c}) = \frac{{\hat{g}}_{1} + {\hat{θ}}^{2} {\hat{g}}_{3} - 2 \hat{θ} {\hat{g}}_{13}}{{\hat{R}}_{Ω}^{2}} .

(21)

See Section A.3 of the Supplementary Material for the derivation. Again, in the case of no sampling errors in the denominators, expression (21) can be simplified.

We carry out another simulation study to investigate the performance of the proposed variance estimator, (21). Our focus remains on the scenarios where the denominators of the ASRs involve sampling errors. Table 10 reports the %RB and CV of the estimator under the simulation setting of Table 8 (and same number of simulation runs, $K = 10, 000$ ). Note that the %RB and CV reported here pertain specifically to the variance estimation.

Table 10.

Performance of variance estimator (Table 8 setting).

			$θ = 0.4260$				$θ = 1.5079$
$N p_{1}$	$ρ_{1}$	$ρ_{2}$	%RB	CV	AL	CP	%RB	CV	AL	CP
50k	0.05	0.2	0.8401	0.1684	0.2921	0.9454	−2.5294	0.0863	0.1831	0.9465
50k	0.2	0.05	9.3000	0.2041	0.3156	0.9504	−4.8023	0.1776	0.2065	0.9485
100k	0.05	0.2	−1.7203	0.1261	0.2085	0.9537	−3.8913	0.0890	0.1576	0.9496
100k	0.2	0.05	12.1513	0.1582	0.2261	0.9613	−8.6372	0.1852	0.1795	0.9453

The results show that the %RB are generally in single-digit or low double-digit, which are considered satisfactory, for example, Jiang and Torabi.¹⁴ The CV ranges from approximately 8.6% to 20.4%.

Table 11 reports the %RB and CV of the variance estimator under the simulation setting of Table 9 (and the same number of simulation runs, $K = 10, 000$ ). Again, the %RB of the variance estimator of ${\hat{θ}}_{bc}$ is in single-digit, which is considered satisfactory. The CV ranges between approximately 5% to 16%.

Table 11.

Performance of variance estimator (Table 9 setting).

		SRS				STR
$N p_{1}$	$θ$	%RB	CV	AL	CP	%RB	CV	AL	CP
50K	0.4260	−0.4012	0.1677	0.2941	0.9538	5.2070	0.0838	0.2945	0.9453
50K	1.5079	2.4295	0.1635	0.1532	0.9545	25.2044	0.0752	0.1533	0.9682
100k	0.4260	0.8827	0.1143	0.2072	0.9471	−0.4535	0.1125	0.2072	0.9479
100k	1.5079	−1.2699	0.0556	0.1067	0.9519	28.0175	0.0506	0.1068	0.9539

Based on the variance estimator, we construct large-sample confidence interval for the RASR, $θ$ , in the form of $[{\hat{θ}}_{bc} - z_{α / 2} s . e . ({\hat{θ}}_{bc}), {\hat{θ}}_{bc} + z_{α / 2} s . e . ({\hat{θ}}_{bc})]$ , where $α$ is the level of significance, and $s . e . ({\hat{θ}}_{bc}) = {\hat{Var} ({\hat{θ}}_{bc})}^{1 / 2}$ . The performance of the confidence interval is measured by the empirical (simulated) coverage probability (CP) and average length of the interval based on the simulations (AL). We report $CP = K^{- 1} \sum_{k = 1}^{K} 1_{(θ \in {CI}^{(k)})}$ , where $1_{(θ \in {CI}^{(k)})}$ is the indicator that $θ$ falls within the confidence interval, ${CI}^{(k)}$ , in the $k$ th simulation run. For $α = 0.05$ , that is, 95% confidence interval, the simulation results (again based on $K = 10, 000$ simulation runs) are also presented in Tables 10 and 11. It is seen that the CP of the confidence interval is fairly close to the nominal level (0.95) in all cases; the AL decreases as the population size (hence the sample size) increases, which is reasonable. Additional results are deferred to Section B of the Supplementary Material.

The confidence interval simulation results suggest asymptotic normality of the proposed bias-corrected estimator, ${\hat{θ}}_{bc}$ . Such a result can be derived following the standard asymptotic framework of finite population sampling; see, for example, Lohr¹³ (sec. 2.5), with further details referred to Fuller¹⁵ (ch. 1). Note that, because we are dealing with a finite population, the sample size cannot go to infinity without the population size also increasing. So, under the standard asymptotic framework, both the sample size and the associated population size are assumed to increase. Under this framework, asymptotic normality of sampling statistics, such as the sample mean or proportion, can be established. Note that ASR is a weighted sum of independent sample proportions. The asymptotic normality of the ASR then follows. Asymptotic normality of the naive RASR (ratio of ASRs), ${\hat{θ}}_{N}$ , can then be derived via the delta-method (e.g. Jiang 2022, ex. 4.4). The difference between ${\hat{θ}}_{bc}$ and ${\hat{θ}}_{N}$ is the bias-correction term. However, this term is of lower order, which does not change the asymptotic normality result.

6. A real-data application

To demonstrate the performance of various RASR estimators in practice, we analyze 2018 U.S. all cancer cause mortality data collected by the National Center for Health Statistics’ National Vital Statistics Systems and accessed through the National Cancer Institute (NCI)’s SEER Program. Specifically, we focus on four scenarios that were inspired by empirical studies.^16–19 They were chosen to reflect different group size and within group age structures: (i) foreign-born Hispanic in California compared with New Mexico; (ii) foreign-born Hispanic in California and New Mexico compared with those in the entire U.S.; (iii) foreign-born non-Hispanic Asian and Pacific Islander (NHAPI) compared with all races in California and Hawaii; and (iv) US-born NHAPI in California and Hawaii compared with those in the entire U.S. NHAPI.

The naive ASR estimator simply calculates the ASRs, treating the populations as fixed quantities and error-free. The naive RASR estimator is computed as a comparison. Here, we adopt some terms from the practitioners, referring to the numerator population as the “child group” and the denominator population as the “parent group.” All ASRs are per 100,000 person-years and are standardized to the 2000 U.S. standard population by 5-year age group, with the oldest age group being 85 years and older combined. Cancer sites were coded according to the International Statistical Classification of Diseases (10th revision). Annual populations and sampling errors of Hispanics and NHAPI by age group, sex, and nativity for California, Hawaii, New Mexico, and the U.S. were obtained from the NCI’s SEER Program.⁷ They were estimated using the 2018 1-year American Community Survey sample drawn from the University of Minnesota’s IPUMS-USA.²⁰ In addition to the 2018 data, we also examined the 2010 RASRs between Hispanic and all races among New Mexico foreign-born males to assess scenarios where population sampling errors are considerably larger.

As shown in Table 12, the annual population estimates based on the ACS for California and U.S. Hispanic and NHAPI populations are all highly precise across age groups, sex, and nativity. These estimates exhibit low CV, indicating small standard errors relative to the population totals across all age groups used for age standardization; the sampling errors appear to have limited impact on the estimation of ASRs. In comparison, Hawaii shows reduced precision in NHAPI population estimates, though the sampling error remains within a reasonable range. In contrast, the sampling errors for New Mexico are significantly higher among four scenarios, especially among the foreign-born Hispanic male population in 2010 (max C.V.=0.90).

Table 12.
Distribution of population estimates, death counts, age-standardized rates, and corresponding max coefficient of variations (CVs) of child group and parent group across 19 age groups by immigration status in California, Hawaii, and New Mexico, ACS 2010 and 2018.

Child group Parent group

Year Death Population ASR Max CV Death Population ASR Max CV

Foreign-born Hispanic Foreign-born All Races

CA 2018 27567 5334473 47.7557 0.0804 70323 11081029 45.6755 0.0447

NM 2018 957 161091 63.5834 0.3934 1457 215772 66.5075 0.3685

Foreign-born Hispanic Male Foreign-born All Races Male

CA 2010 10188 2809117 53.4243 0.0796 26113 5137983 56.2665 0.05456

NM 2010 388 84555 41.2995 0.7031 525 106973 51.9810 0.4544

Foreign-born Hispanic of State Foreign-born Hispanic of US

CA 2018 27567 5334473 47.7557 0.0804 92263 20579200 46.4466 0.0430

NM 2018 957 161091 63.5834 0.3934 92263 20579200 46.4466 0.0430

Foreign-born NH-API Foreign-born All Races

CA 2018 23669 3889404 38.7743 0.0748 70323 11081029 45.6755 0.0447

HI 2018 1950 242761 44.0348 0.2913 2241 292496 43.8961 0.2299

US-born NH-API of State US-born NH-API of US

CA 2018 4945 2359910 40.5941 0.0891 15389 7373465 43.7114 0.0548

HI 2018 5694 635217 59.2086 0.0894 15389 7373465 43.7114 0.0548

		Child group	Parent group
		Foreign-born Hispanic	Foreign-born All Races
CA	2018	27567	5334473	47.7557	0.0804	70323	11081029	45.6755	0.0447
NM	2018	957	161091	63.5834	0.3934	1457	215772	66.5075	0.3685
		Foreign-born Hispanic Male	Foreign-born All Races Male
CA	2010	10188	2809117	53.4243	0.0796	26113	5137983	56.2665	0.05456
NM	2010	388	84555	41.2995	0.7031	525	106973	51.9810	0.4544
		Foreign-born Hispanic of State	Foreign-born Hispanic of US
CA	2018	27567	5334473	47.7557	0.0804	92263	20579200	46.4466	0.0430
NM	2018	957	161091	63.5834	0.3934	92263	20579200	46.4466	0.0430
		Foreign-born NH-API	Foreign-born All Races
CA	2018	23669	3889404	38.7743	0.0748	70323	11081029	45.6755	0.0447
HI	2018	1950	242761	44.0348	0.2913	2241	292496	43.8961	0.2299
		US-born NH-API of State	US-born NH-API of US
CA	2018	4945	2359910	40.5941	0.0891	15389	7373465	43.7114	0.0548
HI	2018	5694	635217	59.2086	0.0894	15389	7373465	43.7114	0.0548

Figures 1 and 2 show the population proportion across different age groups by state, gender, race, year, and nativity under different scenarios. It becomes evident that the age distribution of US- born NHAPI in Hawaii differs significantly from that of the U.S. as shown in (b) of Figure 1. This discrepancy indicates that the PAD assumption is void. There are several cases where the age distribution of child group is different from that of the parent group within some age groups, such as California and the U.S. in the 2018 foreign-born Hispanic (State/US) scenario. This means the PAD assumption may partially hold among certain age groups; still, it may not be reasonable to assume that the PAD assumption holds for all age groups. For the remaining cases, the trend of age distribution is similar between the parent group and the child group, such as New Mexico and the U.S. in the 2018 foreign-born Hispanic (State/US) scenario, in which case the PAD assumption may be valid.

Figure 1.

Population age distribution across 19 age groups under different scenarios. (a) 2018 Foreign-born Hispanic (State/US) and (b) 2018 US-born NH-API (State/US).

Figure 2.

Population age distribution across 19 age groups under different scenarios. (a) 2018 Foreign-born in State (Hispanic/All), (b) 2010 Foreign-born Male in State (Hispanic/All) and (c) 2018 Foreign-born in State (NH-API/All).

Table 13 compares RASR estimates of all cancer cause mortality under 5 scenarios using the naive ( ${\tilde{θ}}_{N}$ ), Tiwari’s estimator ( ${\hat{θ}}_{T}$ ), the bias-corrected estimator not along the proportionate set-up ( ${\hat{θ}}_{bc}$ ), the estimators ${\hat{θ}}_{*}$ and ${\tilde{θ}}_{*}$ . As expected, all five estimators produce very similar point estimates in California. It is notable that in the Hispanic male scenario in 2010, the two PAD assumption dependent estimators, ${\hat{θ}}_{T}$ and ${\hat{θ}}_{*}$ , show slight over-estimation bias, compared to the other estimators, including the naive estimator ${\hat{θ}}_{N}$ . That is likely due to violation of the PAD assumption, as shown in Figure 2. Although the overlap between the child and parent groups is large, because the sampling error is relatively small, the differences among other estimators are not substantial.

Table 13.

RASR estimates and lower/upper bounds of 95% confidence intervals, obtained via normal approximation after log-transformation.

			${\hat{θ}}_{N}$
Naive Estimator	Year	State	Est	Lower CI	Upper CI
Foreign-born in State(Hispanic/All)	2018	CA	1.0455	1.0302	1.0611
		NM	0.9560	0.8768	1.0417
Foreign-born Male in State(Hispanic/All)	2010	CA	0.9516	0.9277	0.9760
		NM	1.2155	1.0263	1.4346
Foreign-born Hispanic(state/US)	2018	CA	1.0282	1.0136	1.0430
		NM	1.3690	1.2792	1.4618
Foreign-born in State(NH-API/All)	2018	CA	0.8489	0.8360	0.8620
		HI	1.0032	0.9382	1.0723
US-born NH-API(state/US)	2018	CA	0.9287	0.8970	0.9612
		HI	1.3545	1.3111	1.3991
			${\hat{θ}}_{T}$			${\hat{θ}}_{*}$
PAD-dependent estimators	Year	State	Est	Lower CI	Upper CI	Est	Lower CI	Upper CI
Foreign-born in State	2018	CA	1.0441	1.0231	1.0650	1.0437	1.0228	1.0647
(Hispanic/All)		NM	0.9104	0.6918	1.1389	1.0332	0.9349	1.1133
Foreign-born Male in	2010	CA	0.9742	0.9417	1.0064	0.9733	0.9409	1.0055
State (Hispanic/All)		NM	1.0580	0.7977	1.1599	0.9776	0.7008	1.1441
Foreign-born Hispanic	2018	CA	1.0296	0.9993	1.0602	1.0291	0.9990	1.0598
(State/US)		NM	1.3687	1.1069	1.6574	1.3057	1.0573	1.6118
Foreign-born in State	2018	CA	0.8391	0.8167	0.8617	0.8389	0.8166	0.8616
(NH-API/All)		HI	0.9484	0.8830	1.0101	0.9579	0.8991	1.0105
US-born NH-API	2018	CA	0.9273	0.8680	0.9883	0.9255	0.8669	0.9864
(State/US)		HI	1.5146	1.4044	1.6324	1.5119	1.4025	1.6286
			${\tilde{θ}}_{*}$			${\hat{θ}}_{b c}$
PAD-free Estimators	Year	State	Est	Lower CI	Upper CI	Est	Lower CI	Upper CI
Foreign-born in State	2018	CA	1.0451	1.0176	1.0734	1.0447	1.0172	1.0730
(Hispanic/All)		NM	0.9941	0.9281	1.0649	0.9396	0.8380	1.0534
Foreign-born Male in	2010	CA	0.9506	0.9061	0.9972	0.9495	0.9078	0.9930
State (Hispanic/All)		NM	0.9673	0.8531	1.0993	0.8718	0.3783	2.0091
Foreign-born Hispanic	2018	CA	1.0278	0.9989	1.0575	1.0274	0.9983	1.0574
(State/US)		NM	1.3060	1.0865	1.5700	1.3054	1.0563	1.6133
Foreign-born in State	2018	CA	0.8488	0.8280	0.8702	0.8485	0.8292	0.8684
(NH-API/All)		HI	1.0161	0.9878	1.0468	1.0016	0.9598	1.0452
US-born NH-API	2018	CA	0.9275	0.8727	0.9857	0.9257	0.8714	0.9833
(State/US)		HI	1.3538	1.2920	1.4187	1.3508	1.2800	1.4256

Conversely, the impact of sampling error coupled with the correlation between child and parent group is evident in New Mexico. The RASR estimator stands out due to substantial bias in the child groups but relatively minor bias in the parent groups. As a result, the bias-corrected RASR estimates, which do not depend on the PAD assumption, are much lower than the naive estimates. Note that, based on our earlier study, we expect the two PAD-free bias-corrected estimators, ${\hat{θ}}_{bc}$ and ${\tilde{θ}}_{*}$ , to be more accurate than the two PAD-dependent estimators, ${\hat{θ}}_{T}$ and ${\hat{θ}}_{*}$ , across all scenarios, with ${\hat{θ}}_{bc}$ being the most accurate among them. Additionally, the magnitude of correction regarding the covariance term is very small.

It is also worth noting that the impact direction of the violation of the PAD assumption is unpredictable. For example, in the 2018 foreign-born NH-API/ALL scenario, the Tiwari estimator, ${\hat{θ}}_{T}$ , is lower than the naive estimator in Hawaii; conversely, in the 2018 US-born NH-API State/US scenario, ${\hat{θ}}_{T}$ is much higher than the naive estimator in Hawaii. This variation can be observed as the proportions of the young age group in the child group are lower than those in the parent group for both cases, while the proportion of the old age group are higher. Thus, the direction of the impact of the PAD violation cannot be solely predicted by the age distribution pattern. Additionally, the rates within each age group for both the child group and the parent group may affect the direction of impact.

7. Discussion

In this article, we develop and thoroughly evaluate a novel rate ratio (RR) estimator designed for situations where the comparison group is a subset of the reference group, and where population totals used to estimate ASRs in both groups are subjecting to sampling errors. This innovative estimator robustly identifies demographic groups or geographic regions with heightened cancer incidence or mortality risk. By utilizing the overall population as the reference group, it effectively overcomes the challenge of pinpointing a well-defined independent “unexposed” group to represent the expected risk level in cancer risk assessment.

Specifically, this advancement extends the recent development of a bias-corrected rate ratio estimator for comparing two independent groups when population data of both groups involve sampling errors. While it employs a similar two-stage framework to integrate sampling errors from the finite population theory and Poisson-distributed variability from the superpopulation theory, this new development also accounts for the correlation structure between the two comparing groups due to their nested nature, a crucial feature to ensure high precision in risk assessments.

Another significant innovation is that the proposed estimators do not impose the PAD assumption for the age distribution structure as seen in existing methods. The PAD assumption can often be restrictive and unrealistic, especially in diverse populations where age distribution may vary significantly. By eliminating the need for this assumption, our proposed RR estimator provides a more flexible and realistic approach to estimating cancer risk.

Furthermore, it removes the sole reliance on census-based populations data as the at-risk populations in estimating ASRs by allowing for the use of national survey-based population data, thus greatly broadening the scope of cancer risk assessment. This capability is particularly important in public health research where sample surveys are the sole data source for collecting detailed cancer risk drivers, such as immigration status, education status, cancer screening status, etc. Accurately identifying nuanced and granular heterogeneities in disease rates is crucial for identifying targeting public health interventions to achieve public health for all Americans, especially in resource-limited settings where efficient resource allocation is essential to maximize impact.

One limitation of this new RR estimator is that it is not suitable for comparing a specific age group with the entire population, as age has already been incorporated into the calculation of ASRs. For example, it cannot be used to evaluate the risk of early onset in cancers, meaning that comparing the ASR for individuals under 50 years old with that of the entire population is not permissible. However, users can still use existing estimators, which assume independence, to compare individuals under 50 years with those over 50 years; for instance, Jiang (2022)’s bias-corrected RR estimator and the naïve RR estimator, $θ_{N}$ , can be used when the population data does or does not involve sampling errors, respectively. Note that Jiang’s bias-corrected RR estimator was not evaluated in this study because it deals with two independent groups. An adaptation of this existing RR to nested populations is evaluated as ${\tilde{θ}}_{*}$ .

Finally, this new estimator is being incorporated into NCI’s SPARC (https://surveillance.cancer.gov/sparc/), a web-based calculator for assessing disparities in cancer mortality rates using survey sample-based populations. This advancement in methods aims to more accurately pinpoint the areas for elevated cancer risks.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802261459842 - Supplemental material for Inference about the ratio of age-standardized rates between two overlapping populations

Supplemental material, sj-pdf-1-smm-10.1177_09622802261459842 for Inference about the ratio of age-standardized rates between two overlapping populations by Jiangshan Zhang, Jiming Jiang and Mandi Yu in Statistical Methods in Medical Research

Footnotes

ORCID iDs

Jiangshan Zhang

Jiming Jiang

Funding

The authors received financial support for the research, authorship, and/or publication of this article: This work is partially supported by the National Cancer Institute.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental material

Supplemental material for this article is available online.

References

Sherman

Firth

Henley

, et al. Annual report to the nation on the status of cancer, featuring state-level statistics after the onset of the COVID-19 pandemic. Cancer 2025; 131: e35833. DOI: 10.1002/cncr.35833

Grant

Yanguela

Odebunmi

, et al. Systematic review of interventions addressing racial and ethnic disparities in cancer care and health outcomes. J Clin Oncol 2024; 42: 1563–1574.

Wheeler

Basch

. Translating cancer surveillance data into effective public health interventions. JAMA 2017; 317: 365.

Fay

. Approximate confidence intervals for rate ratios from directly standardized rates with sparse data. Commun Stat Theory Methods 1999; 28: 2141–2160.

Tiwari

Zou

. Interval estimation for ratios of correlated age-adjusted rates. J Data Sci 2010; 8: 471–482.

Zhu

Pickle

Pearson

. Confidence intervals for rate ratios between geographic units. Int J Health Geogr 2016; 15: 44.

SEER*Stat database: populations—total U.S. (2006- 2021) (excl ga 2008–2009) by US born indicator, National Cancer Institute, DCCPS, Surveillance Research Program, Cancer Statistics Branch, 2025. http://www.seer.cancer.gov.

Jiang

Nguyen

, et al. Inference about ratios of age-standardized rates with sampling errors in the population denominators for estimating both rates. Stat Med 2022; 41: 2052–2068.

Jiang

Feuer

, et al. Inference about age-standardized rates with sampling errors in the denominators. Stat Methods Med Res 2020; 30: 535–548.

10.

Liu

Gibson

, et al. Assessing racial, ethnic, and nativity disparities in us cancer mortality using a new integrated platform. JNCI: J Natl Cancer Inst 2024; 116: 1145–1157.

11.

Jiang

Jia

Chen

. Maximum posterior estimation of random effects in generalized linear mixed models. Stat Sin 2001; 11: 97–120.

12.

Cochran

. Sampling techniques. 3rd ed. Wiley Series in Probability and Statistics, Nashville, TN: John Wiley & Sons, 1977.

13.

Lohr

. Sampling: design and analysis. Chapman and Hall/CRC, 2021.

14.

Jiang

Torabi

. Sumca: simple, unified, Monte-Carlo-assisted approach to second-order unbiased mean-squared prediction error estimation. J R Stat Soc Ser B: Stat Methodol 2020; 82(2): 467–485.

15.

Fuller

. Sampling Statistics. Wiley, 2009.

16.

Braun

Yang

Onaka

, et al. Asian and Pacific Islander mortality differences in Hawaii. Biodemogr Soc Biol 1997; 44(3–4): 213–226.

17.

Chang

Yang

Alfaro-Velcamp

, et al. Disparities in liver cancer incidence by nativity, acculturation, and socioeconomic status in California Hispanics and Asians. Cancer Epidemiol Biomark Prevent 2010; 19(12): 3106–3118.

18.

Pinheiro

Callahan

Gomez

, et al. High cancer mortality for US-born Latinos: evidence from California and Texas. BMC Cancer 2017; 17(1).

19.

Medina

Callahan

Morris

, et al. Cancer mortality disparities among Asian American and Native Hawaiian/Pacific Islander populations in California. Cancer Epidemiol Biomark Prevent 2021; 30(7): 1387–1396.

20.

Ruggles

Flood

Sobek

, et al. Ipums USA: Version 16.0, 2025. DOI: 10.18128/D010.V16.0.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.23 MB

0.00 MB

		Child group				Parent group
	Year	Death	Population	ASR	Max CV	Death	Population	ASR	Max CV
		Foreign-born Hispanic				Foreign-born All Races
CA	2018	27567	5334473	47.7557	0.0804	70323	11081029	45.6755	0.0447
NM	2018	957	161091	63.5834	0.3934	1457	215772	66.5075	0.3685
		Foreign-born Hispanic Male				Foreign-born All Races Male
CA	2010	10188	2809117	53.4243	0.0796	26113	5137983	56.2665	0.05456
NM	2010	388	84555	41.2995	0.7031	525	106973	51.9810	0.4544
		Foreign-born Hispanic of State				Foreign-born Hispanic of US
CA	2018	27567	5334473	47.7557	0.0804	92263	20579200	46.4466	0.0430
NM	2018	957	161091	63.5834	0.3934	92263	20579200	46.4466	0.0430
		Foreign-born NH-API				Foreign-born All Races
CA	2018	23669	3889404	38.7743	0.0748	70323	11081029	45.6755	0.0447
HI	2018	1950	242761	44.0348	0.2913	2241	292496	43.8961	0.2299
		US-born NH-API of State				US-born NH-API of US
CA	2018	4945	2359910	40.5941	0.0891	15389	7373465	43.7114	0.0548
HI	2018	5694	635217	59.2086	0.0894	15389	7373465	43.7114	0.0548

Inference about the ratio of age-standardized rates between two overlapping populations

Abstract

Keywords

1. Introduction

2. Notations and existing methods

Table 1. Summary of RASR estimators. Estimator Pop. Overlap PAD Proportion est. Sampling error Equation Ref. θ ^ N (Naive) No No No No (2) θ ^ T (Tiwari) Yes Yes Yes No (7) θ ~ * Yes No Yes Yes (16) θ ~ bc Yes No Yes Yes Suppl. 1.2 θ ^ bc Yes No No Yes (18)

Supplemental Material

sj-pdf-1-smm-10.1177_09622802261459842 - Supplemental material for Inference about the ratio of age-standardized rates between two overlapping populations

Footnotes

ORCID iDs

Funding

Declaration of conflicting interests

Supplemental material

References

Supplementary Material