Improved and computationally stable estimation of relative risk regression with one binary exposure

Abstract

In medical statistics, when the effect of a binary risk factor on a binary response is of interest, relative risk is often the preferred measure due to its direct interpretation. However, statistical inference on this quantity is not as straightforward as for other measures of association, especially when further explanatory variables have to be taken into account. Starting from a review of available methods for inference on relative risk, this paper deals with small and moderate sample size settings for which we show that classical approaches can be problematic. For this reason, we propose the use of improved estimation procedures, aiming at mean or median bias reduction of the maximum likelihood estimator. In particular, these methods are developed for a new alternative specification of a model recently proposed by Richardson et al, where higher computational stability of the estimation methods is achieved. A real-data example and extensive simulation studies show that the proposed methods perform remarkably better than the standard ones.

Keywords

Bias reduction infinite estimate relative risk small sample variation independence

1. Introduction

Many statistical studies investigate the influence of a group of explanatory variables on a binary response of interest. This necessity has led to the development of an enormous literature on the topic, with methodologies that differ mainly according to whether the final goal of the study is inference or prediction.^1,2 While, for prediction, the interpretability of the model usually fades into the background, when the main aim is inference the possibility of clear communication of the results is crucial and, in this context, logistic regression is by far the most popular statistical model.

Among the several reasons for the success of logistic regression there is the fact that, unlike some other models that are discussed in this paper, the maximum likelihood estimator can be obtained with standard optimization techniques without imposing any constraint on the parameter space. In addition, several results are available to allow the data analyst to perform valid statistical inference in non-standard regimes such as in situations of small or moderate sample size,^3,4 but also in the opposite case, where the number of predictors and observations is huge.⁵ Despite the above-mentioned advantages, a possible drawback of logistic regression is that the effects of the explanatory variables are parametrized in terms of log-odds ratios and the results can be sometimes difficult to communicate to a non-technical audience.⁶ Even though modelling odds ratios are, in some cases, an obligated choice, as in case-control studies, see, for example, Rothman et al.,⁷ there are many situations where other measures of association, such as the relative risk or the risk difference, can be directly estimated and give more interpretable results.⁸ In this paper, we focus on the relative risk and on some of the difficulties that this parametrization poses to statistical inference.

To estimate the relative risk several approaches are available in the literature.^6,9 The majority of them have been designed to overcome the well-known numerical instability of log-binomial regression.¹⁰ Indeed, even though this model represents a natural way to parametrize relative risk using a generalized linear model, its parameter space is constrained and this can lead to serious computational problems.¹¹ For what concerns point estimation, some of these difficulties have been addressed by Donoghoe and Marschner,¹⁰ who implemented a number of strategies to maximize the log-likelihood function of the model. At the same time, statistical inference based on the maximum likelihood estimator can still be challenging when the estimate lies on the boundary of the parameter space, due to the non-invertibility of the estimated Fisher information. This limitation motivated a growing interest in the study and development of different estimation strategies. In particular, Zou¹² and Lumley et al.⁶ showed that consistent and asymptotically normal estimators of the relative risk can be obtained by replacing the binomial likelihood with the Poisson or the Gaussian ones. These strategies are particular instances of estimation based on unbiased estimating equations. Since in these cases, the second Bartlett’s identity usually does not hold, the asymptotic variance of the estimators has to be obtained by using a sandwich estimator.¹³ A possible drawback of the unbiased estimating equations approach is that fitted probabilities greater than one can easily occur. As argued by many authors, usually this is not problematic in practical situations, since the focus is mainly on parameter estimation.⁶ However, as we illustrate in Section 2, in small sample regimes, the presence of estimated probabilities outside the admissible range can have a strong impact on inference since it can lead to negative estimated variances for some parameters.

Finally, when the main interest is on the effect of a binary exposure and its interactions with the other explanatory variables, an alternative and particularly attractive approach was proposed by Richardson et al.⁹ The novelty of their proposal stands on a different parameter specification, where the interest component is the log-relative risk and the nuisance part is the log-odds product. In this way, the two components of the parameter of the model are variation independent, that is, the parameter space for each component does not depend on the value of the other. As a consequence, the computation of the maximum likelihood estimates and their standard errors is simplified. Moreover, since the model of Richardson et al.⁹ gives a genuine likelihood, the predicted probabilities are always positive and smaller than one.

Even though solving many difficulties regarding inference on relative risk, classical likelihood inference under the model developed by Richardson et al.⁹ still presents two main issues. First, it may be subject to a significant bias, especially with small sample sizes or a large number of explanatory variables. Second, as it happens for logistic regression,¹⁴ infinite maximum likelihood estimates can occur. To reduce both problems, in this paper, we propose the use of mean or median bias reduction techniques developed by Firth¹⁵ and Kenne Pagui et al.,¹⁶ respectively. These methods allow to improve the finite sample properties of the maximum likelihood estimator by reducing its mean or median bias. In addition, as a useful side effect, they solve infinite estimate problems in many practical situations.¹⁷ Recent applications of bias reduction methods to other models of interest in medical statistics can be found in Kyriakou et al.,¹⁸ Kenne Pagui and Colosimo,¹⁹ and in Kenne Pagui et al.,²⁰ while some preliminary simulation results are available²¹ for the model of Richardson et al.⁹

However, as numerically illustrated in Section 7, direct application of mean and median bias reduction to the model of Richardson et al.⁹ may still be affected by convergence issues. For this reason, we propose to apply mean and median bias reduction to a new alternative specification of the nuisance parameter. This parametrization maintains the property of variation independence but leads to easier implementation and higher computational stability of improved estimation methods. An R²² implementation of the proposed models is given in the brrr function available on GitHub (https://github.com/Francesco16p/brrr).

The rest of the article is organized as follows. Section 2 introduces some of the problems regarding relative risk inference with a real-data example. Section 3 briefly reviews some approaches to inference on relative risk with a particular emphasis on the proposal of Richardson et al.⁹ Section 4 introduces the new alternative parametrization for the model of Richardson et al.⁹ and Section 5 describes the mean and median bias reduction methods developed by Firth¹⁵ and Kenne Pagui et al.¹⁶ Section 6 revisits the motivating example by applying the improved estimation methods to the model developed in Section 4. In Section 7, the performance of the bias-reduced estimators is evaluated through two simulation studies, with the second considering models with many nuisance parameters. Section 8 concludes the paper with a short discussion.

2. Motivating example

As a motivating example, we consider data from a clinical study conducted by Goorin et al.²³ and available in Table 1 of Kolassa and Tanner.³ The aim of the investigators was to identify significant prognostic variables to predict the probability of recovery, after three years, for 46 patients which were treated for nonmetastatic osteogenic sarcoma. Along with the health status, three other explanatory variables are available: sex, presence of lymphocytic infiltration (LI) and presence of any osteoid pathology (AOP).

Even in the more standard case in which one wants to make inference by assuming a logistic regression model, the moderate sample size and the presence of separation in the data represent two serious challenges. These problems led Hirji et al.²⁴ and Kolassa and Tanner³ to develop methods for small sample confidence regions. Here, we focus on the different problems of making inferences on the effect of LI controlled for the other explanatory variables and parametrized in terms of log-relative risk. Table 1 reports maximum likelihood estimates of the parameters obtained by assuming a log-binomial regression, a Poisson regression with robust estimation of standard errors and the model of Richardson et al.⁹ Poisson regression and the model of Richardson et al.⁹ give similar point estimates for the log-relative risk associated with LI, while the estimate for the log-binomial regression is smaller in absolute value. For what concerns uncertainty evaluation, the R package logbin¹⁰ does not provide any estimate for the standard errors since the maximum likelihood estimate lies on the boundary of the parameter space. The other two methods give similar values for the standard error of the estimated log-relative risk associated with LI. However, both of them present some weaknesses. The sandwich estimate²⁵ for the Poisson regression makes use of the probabilities predicted by the model, which turn out to be $> 1$ for three patients (see Supplemental material SC2.2). Due to the limited sample size, we get a negative estimate for the variance of the intercept. This demonstrates that it is always possible to have a negative sandwich variance estimate, which would render Wald-type inference unusable if it happens for the parameter of interest. On the other hand, the model of Richardson et al.⁹ shares the same problems of logistic regression, with some of the estimates in the nuisance component that are not finite.

Table 1.
Estimates of the log-relative risk associated with LI and of the other model parameters for the motivating example in Section 2.

Model Estimate Std error $p$ -value

Log-binomial

Intercept 0.000 – –

Sex −0.288 – –

AOP −0.346 – –

LI −0.164 – –

Poisson

Intercept 0.287 – –

Sex −0.399 0.184 0.031

AOP −0.381 0.225 0.090

LI −0.444 0.148 0.003

Richardson et al.⁹

Intercept (nuisance) 15.164 41.779 0.717

Sex (nuisance) −8.947 41.013 0.827

AOP (nuisance) −6.341 8.067 0.432

LI −0.402 0.160 0.012

Model	Estimate	Std error	$p$ -value
Log-binomial
Intercept	0.000	–	–
Sex	−0.288	–	–
AOP	−0.346	–	–
LI	−0.164	–	–
Poisson
Intercept	0.287	–	–
Sex	−0.399	0.184	0.031
AOP	−0.381	0.225	0.090
LI	−0.444	0.148	0.003
Richardson et al.⁹
Intercept (nuisance)	15.164	41.779	0.717
Sex (nuisance)	−8.947	41.013	0.827
AOP (nuisance)	−6.341	8.067	0.432
LI	−0.402	0.160	0.012

AOP: any osteoid pathology; LI: lymphocytic infiltration.

3. Modelling relative risk

Let $y_{1}, \dots, y_{n}$ be realizations of $n$ independent binary random variables $Y_{1}, \dots, Y_{n} .$ For each observation, there are $d + p + 1$ explanatory variables $x_{i}, z_{i}$ , and $t_{i},$ with $x_{i}^{T} = (x_{i 0}, \dots, x_{i d - 1}),$ $z_{i}^{T} = (z_{i 0}, \dots, z_{i p - 1}),$ where $x_{i 0} = z_{i 0} = 1, i = 1, \dots, n,$ and $t_{i}$ is a binary exposure. For later use, let $X$ denote the $n \times d$ design matrix with generic row $x_{i}^{T}$ and $Z$ the $n \times p$ design matrix with generic row $z_{i}^{T} .$ Moreover, we assume that the expected value of $Y_{i}, i = 1, \dots, n,$ depends on $x_{i}, z_{i},$ and $t_{i},$ and therefore we write $E (Y_{i}; x_{i}, z_{i}, t_{i}) = P (Y_{i} = 1; x_{i}, z_{i}, t_{i}) = π (x_{i}, z_{i}, t_{i}) = π_{i} .$ In the following, the parameter of interest is the logarithm of the relative risk associated with $t_{i},$ which is defined as $\log ({RR}_{i}) = \log (π_{i 1} / π_{i 0}),$ where $π_{i 1} = π (x_{i}, z_{i}, t_{i} = 1)$ and $π_{i 0} = π (x_{i}, z_{i}, t_{i} = 0) .$ We distinguish between $x_{i}$ and $z_{i}$ because we aim at models, where

\log ({RR}_{i}) = z_{i}^{T} γ

(1)

with

γ

p

-dimensional parameter vector. This means that the log-relative risk is only affected by

z_{i},

while

x_{i}

represents other potential confounding covariates.

Inference on $\log ({RR}_{i})$ is usually conducted through generalized linear models that assume

g (π_{i}) = η_{i} = x_{i}^{T} α_{1} + t_{i} z_{i}^{T} α_{2}, i = 1, \dots, n

where

g (\cdot)

is an appropriately chosen link function,

α_{1}

is a

d

-dimensional vector, which parametrizes the intercept and the main effects of the

d - 1

confounding variables and

α_{2}

is a

p

-dimensional vector, which represents the effect of the binary exposure and its interactions with

z_{i} .

The two most common choices are logistic and log-binomial regression.

With the former, $g (π_{i}) = \log {π_{i} / (1 - π_{i})},$ which implies

\log [π_{i 1} (1 - π_{i 0}) / {π_{i 0} (1 - π_{i 1})}] = z_{i}^{T} α_{2}

so that

α_{2}

is an approximation of

γ

in (1) when

π_{i 0}

and

π_{i 1}

are both close to 0. If this condition is not satisfied, the approximation can be very poor, even with moderate values of

π_{i 0}

and

π_{i 1},

as shown in McNutt et al.⁸

On the other hand, log-binomial regression assumes $g (π_{i}) = \log (π_{i}),$ so that $\log ({RR}_{i}) = z_{i}^{T} α_{2}$ and $α_{2}$ coincides with $γ$ in (1). The main disadvantage of this model is that several constraints have to be set on the parameter space to ensure that probabilities are bounded between 0 and 1. From the computational point of view, these constraints lead to serious convergence issues, especially when the maximum likelihood estimate lies on the boundary of the parameter space.¹¹ Even though Donoghoe and Marschner¹⁰ developed a robust expectation–maximization strategy to obtain the maximum likelihood estimate for log-binomial regression, statistical inference with this model is still problematic. Indeed, when estimates are on the boundary of the parameter space, uncertainty evaluation cannot be done by exploiting standard first-order asymptotic arguments and ad-hoc methods are needed.²⁶ These difficulties led Zou¹² and Carter et al.²⁵ to consider a quasi-likelihood approach. In particular, the authors propose the use of the Poisson model in place of the Bernoulli one. With this choice, the model is clearly misspecified, but the estimating equation remains unbiased. A robust estimate of the variance of the coefficients is then obtained by employing a sandwich estimator. However, the higher computational stability of this method comes with a price. Estimated probabilities which are $> 1$ may occur, because in Poisson regression with a logarithmic link, the parameters of the model are free to vary in $R^{d + p} .$

Outside the generalized linear models’ framework, Richardson et al.⁹ developed a new method to make inference directly on $\log ({RR}_{i}) .$ In particular, they propose the parametrization

η_{i 0} = \log [π_{i 0} π_{i 1} / {(1 - π_{i 0}) (1 - π_{i 1})}] and η_{i 1} = \log ({RR}_{i})

(2)

where the log-relative risk

η_{i 1}

and nuisance parameter

η_{i 0}

are variation independent, that is, the pair

(η_{i 0}, η_{i 1})

can the take any value in

R^{2}

. The parameters are linked to the explanatory variables using two linear predictors, that is,

η_{i 0} = x_{i}^{T} β

and

η_{i 1} = z_{i}^{T} γ,

where

β = (β_{0}, \dots, β_{d - 1})^{T}

and

γ = (γ_{0}, \dots, γ_{p - 1})^{T}

with

(β^{T}, γ^{T})^{T} = θ \in R^{d + p} .

These assumptions imply that

π_{i} = π_{i 0} \exp (t_{i} η_{i 1})

, where

\begin{aligned} π_{i 0} & = π_{i 0} (η_{i 0}, η_{i 1}) \\ = \frac{- {\exp (η_{i 1}) + 1} \exp (η_{i 0}) + \sqrt{\exp (2 η_{i 0}) {\exp (η_{i 1}) + 1}^{2} + 4 \exp (η_{i 1} + η_{i 0}) {1 - \exp (η_{i 0})}}}{2 \exp (η_{i 1}) {1 - \exp (η_{i 0})}} \end{aligned}

(3)

and the log-likelihood function is

ℓ (θ) = \sum_{i = 1}^{n} {y_{i} \log (π_{i}) + (1 - y_{i}) \log (1 - π_{i})}

(4)

with the

π_{i}

’s expressed in terms of

θ .

In this model, the maximum likelihood estimator

\hat{θ} = ({\hat{γ}}^{T}, {\hat{β}}^{T})^{T}

is obtained by solving the likelihood equation

(\partial / \partial θ) ℓ (θ) = 0

without any constraint on the parameter space. A major drawback of this proposal is the presence of

d

nuisance parameters which are not of primary interest but are needed to take properly into account the explanatory variables

x_{i}

and

z_{i}

that affect both

π_{i 0}

and

π_{i 1} .

In a situation with many explanatory variables, or when the number of observations is small, this could heavily affect the properties of the maximum likelihood estimator. In addition, expression (3) is not defined when

η_{i 0} = 0.

Richardson et al.⁹ deal with this problem by imposing some controls in the estimation algorithm. Since these constraints could be problematic for the computational stability of the fitting method, in the next section, we propose a different choice for the nuisance parameter, which leads to probabilities that are well-defined for all possible values of the parameters in the model.

4. Alternative specification

In Section 3, we highlighted the fact that (3) is not defined when $η_{i 0} = 0.$ In order to overcome this problem, we propose a different nuisance parameter $η_{i 0}^{A},$ which allows us to obtain probabilities that are defined for every possible combination of the pair $(η_{i 0}^{A}, η_{i 1}) .$ In particular, if we set

η_{i 0}^{A} = \log [π_{i 0} / {(1 - π_{i 0}) (1 - π_{i 1})}] and η_{i 1} = \log ({RR}_{i})

(5)

we have that

(η_{i 0}^{A}, η_{i 1})

is still a one-to-one transformation of

(π_{i 0}, π_{i 1}),

where the parameters

η_{i 0}^{A}

and

η_{i 1}

are variation independent. This fact can be clearly seen in Figure 1, since the red line intersects each one of the black curves at exactly one point. The analytical proof of this result is provided in Appendix A.1.

Figure 1.

Contour plot of $η_{i 0}^{A} .$ The red line represents the values of $(π_{i 0}, π_{i 1})$ for which $\log ({RR}_{i}) = \log (3 / 2) .$

In this alternative parametrization, the probabilities assumed by the model are equal to $π_{i} = π_{i 0} \exp (t_{i} η_{i 1})$ with

\begin{aligned} π_{i 0} & = π_{i 0}^{A} (η_{i 0}^{A}, η_{i 1}) \\ = \frac{[1 + \exp (η_{i 0}^{A}) {1 + \exp (η_{i 1})}] - \sqrt{{[1 + \exp (η_{i 0}^{A}) {1 + \exp (η_{i 1})}]}^{2} - 4 \exp (η_{i 1} + 2 η_{i 0}^{A}})}{2 \exp (η_{i 1} + η_{i 0}^{A})} \end{aligned}

(6)

and, since in (6) the denominator is never equal to 0,

π_{i}

takes values between 0 and 1 for all

(η_{i 0}^{A}, η_{i 1}) \in R^{2} .

As with model (2),

η_{i 0}^{A}

and

η_{i 1}

are linked to the explanatory variables using two linear predictors, that is,

η_{i 0}^{A} = x_{i}^{T} β^{A}

and

η_{i 1} = z_{i}^{T} γ,

where

β^{A} = (β_{0}^{A}, β_{1}^{A}, \dots, β_{d - 1}^{A})^{T}

and

γ = (γ_{0}, γ_{1}, \dots, γ_{p - 1})^{T} .

The log-likelihood function for $θ^{A} = ((β^{A})^{T}, γ^{T})^{T} \in R^{d + p}$ is given by (4) with $π_{i} = π_{i 0} \exp (t_{i} η_{i 1})$ and $π_{i 0}$ expressed by (6). The score function $U = U (θ) = (\partial / \partial θ^{A}) ℓ (θ)$ has the form $U = ((U_{β}^{A})^{T}, U_{γ}^{T})^{T}$ with components

U_{β}^{A} = X^{T} V^{- 1} D_{1}^{A η_{0}} (y - π) and U_{γ} = Z^{T} V^{- 1} D_{1}^{η_{1}} (Y - π)

where

π = (π_{1}, \dots, π_{n})^{T},

V

is a diagonal matrix with entries

v_{i} = π_{i} (1 - π_{i}),

D_{1}^{A η_{0}}

is a diagonal matrix with entries

d_{1, i}^{A η_{0}} = (\partial / \partial η_{i 0}^{A}) π_{i}, i = 1, \dots, n,

and

D_{1}^{η_{1}}

is a diagonal matrix with entries

d_{1, i}^{η_{1}} = (\partial / \partial η_{i 1}) π_{i}, i = 1, \dots, n .

The expected information is given by

i (θ) = [\begin{matrix} i_{β^{A} β^{A}} & i_{β^{A} γ} \\ i_{γ β^{A}} & i_{γ γ} \end{matrix}] = [\begin{matrix} X^{T} V^{- 1} {(D_{1}^{A η_{0}})}^{2} X & X^{T} V^{- 1} D_{1}^{A η_{0}} D_{1}^{η_{1}} Z \\ Z^{T} V^{- 1} D_{1}^{A η_{0}} D_{1}^{η_{1}} X & Z^{T} V^{- 1} {(D_{1}^{η_{1}})}^{2} Z \end{matrix}]

As in Richardson et al.,⁹ the maximum likelihood estimator

{\hat{θ}}^{A} = (({\hat{β}}^{A})^{T}, {\hat{γ}}^{T})^{T}

is obtained by solving the likelihood equation without any constraint on the parameter space.

Even adopting the alternative parametrization (5), maximum likelihood estimation may be improved by using bias reduction methods, which are illustrated in the following section for a general statistical model.

5. Modified score equations

Let $ℓ (θ)$ be the log-likelihood function for a general parametric model with parameter $θ = (θ_{1}, \dots, θ_{q})^{T} \in Θ \subseteq R^{q} .$ Let $U_{r} = U_{r} (θ) = \partial ℓ (θ) / \partial θ_{r}$ be the $r$ th element of the score function, $U (θ),$ for $r = 1, \dots, q .$ The observed and the expected information are denoted, respectively, by $j (θ) = - \partial^{2} ℓ (θ) / (\partial θ \partial θ^{T})$ and $i (θ) = E_{θ} {j (θ)} .$ Firth¹⁵ and Kenne Pagui et al.¹⁶ show that it is possible to obtain estimators with smaller mean and median bias, compared to the maximum likelihood one, through a suitable modification of the score function. The papers by Kosmidis and Firth²⁷ and Kenne Pagui et al.²⁸ give alternative and more computationally attractive matrix expressions of these adjustment terms. In particular, the mean bias-reduced estimator ${\hat{θ}}^{*}$ proposed by Firth¹⁵ can be obtained by solving the equation

U^{*} (θ) = U (θ) + A^{*} (θ) = 0

(7)

The vector

A^{*} (θ)

has elements

A_{r}^{*} (θ) = Tr [i (θ)^{- 1} {P_{r} (θ) + Q_{r} (θ)}] / 2,

where

Tr (\cdot)

is the trace operator,

P_{r} (θ) = E_{θ} {U (θ) U (θ)^{T} U_{r} (θ)}

and

Q_{r} (θ) = - E_{θ} {j (θ) U_{r} (θ)}

for

r = 1, \dots, q .

Similarly, the median bias-reduced estimator,

\tilde{θ},

is obtained by solving

\tilde{U} (θ) = U (θ) + \tilde{A} (θ) = 0

(8)

with

\tilde{A} (θ) = A^{*} (θ) - i (θ) {\tilde{F}}_{2} .

The vector

{\tilde{F}}_{2}

has entries

{\tilde{F}}_{2, r} = [i (θ)^{- 1}]_{r}^{T} F_{2 r}

for

r = 1, \dots, q,

where

[i (θ)^{- 1}]_{r}

represents the

r

th column of

i (θ)^{- 1}

and

F_{2 r}

is a vector with elements

F_{2 s, r} = Tr [h_{r} (θ) {P_{s} (θ) / 3 + Q_{s} (θ) / 2}]

for

s = 1, \dots, q .

Above,

h_{r} (θ) = [i (θ)^{- 1}]_{r} [i (θ)^{- 1}]_{r}^{T} / i^{r r} (θ)

is a

q \times q

matrix, where

i^{r r} (θ)

is the

(r, r)

element of

i (θ)^{- 1} .

Under mild regularity conditions,

{\hat{θ}}^{*}

has bias of order

O (n^{- 2}),

that is,

E_{θ} ({\hat{θ}}^{*}) = θ + O (n^{- 2}) .

Since the bias is strictly tied to the parametrization,

{\hat{θ}}^{*}

is equivariant only under linear transformations. The median bias-reduced estimator is component-wise median unbiased with an error of order

O (n^{- 3 / 2})

in the continuous case, that is,

P_{θ} ({\tilde{θ}}_{r} \leq θ_{r}) = 1 / 2 + O (n^{- 3 / 2}) .

Moreover,

\tilde{θ}

is equivariant under monotone component-wise transformations of

θ .

Asymptotically, both

{\hat{θ}}^{*}

and

\tilde{θ}

have the same multivariate normal distribution as the maximum likelihood estimator,

N_{q} (θ, i (θ)^{- 1})

. The quantities needed to obtain the mean and median bias-reduced estimators for the model with parametrization (5), as well as the ones for model (2), are given in the Supplemental material, Section S1. Moreover, all the improved methods for models (5) and (2), together with maximum likelihood fitting, are available in the R²² function brrr on GitHub (https://github.com/Francesco16p/brrr). Maximum likelihood fitting for the parametrization of Richardson et al.⁹ can also be performed using the R package brm (https://CRAN.R-project.org/package=brm).

6. The nonmetastatic osteogenic sarcoma dataset

Consider again the nonmetastatic osteogenic sarcoma dataset of Section 2. Here, we apply the mean and median bias-reduced estimators to model (5) and to the one of Richardson et al.⁹ The specification of the two models is $η_{i 1} = γ_{0}$ for the log-relative risk and

\begin{aligned} η_{i 0}^{A} & = β_{0}^{A} + β_{1}^{A} {sex}_{i} + β_{2}^{A} {AOP}_{i} \\ η_{i 0} & = β_{0} + β_{1} {sex}_{i} + β_{2} {AOP}_{i} \end{aligned}

for the nuisance parts. Table 2 shows that the conclusions which can be drawn about the log-relative risk

γ_{0}

do not seem to be affected by the parametrization, confirming that usually the choice of one nuisance parametrization over the other can be made mainly on the basis of computational convenience. Moreover, as it typically happens, the bias-reducing techniques give point estimates that are shrunk toward zero compared to the maximum likelihood ones. These techniques also appear to be effective in solving the infinite estimates problem in the nuisance part. Finally, in this example, standard errors are larger than those associated with maximum likelihood which, together with the shrunk estimates, produces

p

-values that are substantially larger.

Table 2.
Nonmetastatic osteogenic sarcoma dataset: Maximum likelihood (MLE), mean bias-reduced (MEAN) and median bias-reduced (MEDIAN) estimates of $γ_{0}$ in model (5) and in the model of Richardson et al.⁹

Model Estimate of $γ_{0}$ Std error p-value Infinite estimates

Model (5) (MLE) −0.402 0.160 0.012 YES

Model (5) (MEAN) −0.320 0.195 0.102 NO

Model (5) (MEDIAN) −0.325 0.189 0.085 NO

Richardson et al.⁹ (MLE) −0.402 0.160 0.012 YES

Richardson et al.⁹ (MEAN) −0.313 0.193 0.105 NO

Richardson et al.⁹ (MEDIAN) −0.326 0.187 0.081 NO

Model	Estimate of $γ_{0}$	Std error	p-value	Infinite estimates
Model (5) (MLE)	−0.402	0.160	0.012	YES
Model (5) (MEAN)	−0.320	0.195	0.102	NO
Model (5) (MEDIAN)	−0.325	0.189	0.085	NO
Richardson et al.⁹ (MLE)	−0.402	0.160	0.012	YES
Richardson et al.⁹ (MEAN)	−0.313	0.193	0.105	NO
Richardson et al.⁹ (MEDIAN)	−0.326	0.187	0.081	NO

7. Simulation studies

In this section, we present results from two simulation studies providing empirical evidence that mean and median bias reduction largely improve on maximum likelihood both in terms of finite sample properties and in terms of computational stability. The simulation experiment of Section 7.1 corresponds to a standard setting with a fixed number of covariates and increasing sample size. Section 7.2 illustrates the results of simulation experiments with many nuisance parameters. In each simulation, we assume that the model is correctly specified. This results in simulating from model (5) and from the one of Richardson et al.⁹ separately. For this reason, the results for the two models are in general not directly comparable. In the following, we minimize this discrepancy by imposing the nuisance parameters in (5) to be related to those in (2) so that the probabilities in the two simulations are approximately the same. This approximation is reported in Appendix A.2.

7.1. Setting 1: Fixed number of explanatory variables

In this first scenario, we assume

η_{i 1} = γ_{0} z_{i 0} + γ_{1} z_{i 1} + γ_{2} z_{i 2} + γ_{3} z_{i 3}

and

\begin{aligned} η_{i 0}^{A} & = β_{0}^{A} x_{i 0} + β_{1}^{A} x_{i 1} + β_{2}^{A} x_{i 2} + β_{3}^{A} x_{i 3} \\ η_{i 0} & = β_{0} x_{i 0} + β_{1} x_{i 1} + β_{2} x_{i 2} + β_{3} x_{i 3} \end{aligned}

where

i = 1, \dots, n,

z_{i 0} = x_{i 0} = 1,

z_{i 1} = x_{i 1}

is a realization of a Poisson random variable with mean 1 and

z_{i 2} = x_{i 2},

z_{i 3} = x_{i 3}

are drawn from two independent uniform random variables on

(0, 1) .

For half the observations in the sample the binary exposure factor

t_{i}

is equal to 1. The log-relative risk vector is

γ = (1, - 0.5, 0.5, - 0.5)^{T},

while

β^{A}

is implicitly defined by the Taylor approximation reported in Appendix A.2 with

β = (1, - 0.5, 0.5, - 0.5)^{T},

which, in turn, is the true parameter value when the data are generated from the model of Richardson et al.⁹ For what concerns the sample size, we consider three different values,

n

= 40, 80, 120, and, for each of them, we generate

10^{4}

samples of the response variable holding the design matrix fixed.

For each model, the three estimators are evaluated in terms of estimated bias (BIAS), empirical probability of underestimation (PU), root mean squared error (RMSE) and empirical coverage of the 95% Wald-type confidence intervals (WALD). The values of PU and WALD are reported in percentages. For model (5), Figure 2 shows that the median bias-reduced estimator ${\tilde{γ}}^{A}$ presents values of PU that are systematically the closest to 50%, while ${\hat{γ}}^{* A}$ has the smallest BIAS. In addition, ${\tilde{γ}}^{A}$ performs better than ${\hat{γ}}^{A}$ in terms of BIAS and the two bias-reduced estimators have also smaller values of RMSE. In general, the coverage of Wald-type confidence intervals is comparable for all the estimators. However, there are some cases in which the empirical coverage of the Wald-type confidence intervals for the maximum likelihood estimator is quite below the nominal level, in particular, when the corresponding estimator is severely biased. The differences among the three estimators become smaller with increasing $n,$ but they are not negligible even for $n = 120.$ In Figure 2, we only report the results for the estimators of $γ .$ Similar conclusions are drawn for the parameters in $β^{A} .$ Simulations under model (2) lead to analogous conclusions, as can be seen in Figure S1 in the Supplemental material. The results of each simulation are also displayed in Tables S1 and S2 in the Supplemental material.

Figure 2.

Finite sample properties of maximum likelihood (circle), mean bias-reduced estimator (triangle) and median bias-reduced estimator (square) of $γ$ under model (5) specified as in Section 7.1.

We now focus on the computational stability of the proposed methods. Table 3 reports the number of failed convergences of the maximum likelihood estimators and the bias-reduced ones in the two models. When $n = 40,$ the number of observations is small compared to the number of parameters of the model, leading often to infinite maximum likelihood estimates. In this setting, the bias-reduced estimators share a higher computational stability compared to maximum likelihood and are also effective in preventing infinite estimates. With increasing $n,$ the number of failed convergences approaches zero for all the estimators. However, when the sample size increases and the parametrization proposed by Richardson et al.⁹ is adopted, the bias-reduced estimators have a slightly higher number of failed convergences compared to the maximum likelihood estimator. This is not true for the alternative parametrization, which is also remarkably more computationally stable. In our opinion, this is probably due to the corrections that have to be imposed when the denominator in (3) is close to 0. Since equations (7) and (8) are quite mathematically involved, the small perturbations which derive from the corrections lead to higher instability in the estimation algorithms.

Table 3.

Number of failed convergences for each estimator in the simulation study of Section 7.1.

Estimator	$n = 40$	$n = 80$	$n = 120$
${\hat{θ}}^{A}$	624	0	0
$\hat{θ}$	1312	9	0
${\hat{θ}}^{A *}$	31	0	0
${\hat{θ}}^{*}$	91	52	58
${\tilde{θ}}^{A}$	119	0	0
$\tilde{θ}$	319	42	55

7.2. Setting 2: Many nuisance parameters

In Section 7.1, the proposed methodologies were evaluated in settings where the number of observations is significantly bigger than the dimension of the parameter space. Indeed, the mean and the median bias reduction methods of Firth¹⁵ and Kenne Pagui et al.¹⁶ are theoretically justified in the standard asymptotic regime where the number of parameters is fixed and the sample size tends to infinity. At the same time, in current applications, it is not uncommon to deal with models where the parameter dimension is of the same order as the sample size. Motivated by this fact, we investigate the behaviour of the proposed methods through a simulation study where the sample size is fixed and the number of parameters in the nuisance part is allowed to increase, see e.g. He et al.²⁹ In more detail, we consider four different sample sizes, $n = 40, 60, 80, 100,$ and five possible values for the number of coefficients in the nuisance component, that is, $d = 4, 8, 12, 16, 20.$ For each combination of $n$ and $d,$ we generate $10^{4}$ realizations of model (5) and model (2) with $η_{i 1} = γ_{0} z_{i 0}$ and

η_{i 0}^{A} = β_{0}^{A} x_{i 0} + \sum_{j = 1}^{d - 1} β_{j}^{A} x_{i j} or η_{i 0} = β_{0} x_{i 0} + \sum_{j = 1}^{d - 1} β_{j} x_{i j}

respectively, where

i = 1, \dots, n,

z_{i 0} = x_{i 0} = 1

and

x_{i j}, j = 1, \dots, d - 1,

are drawn from independent Bernoulli random variables with probability equal to

0.5.

The binary exposure is equal to 1 for the first

n / 2

observations and 0 otherwise. Throughout the study, the log-relative risk is set to

γ_{0} = 1.

As in Section 7.1, the values of the parameters of the two models are chosen by exploiting the Taylor approximation reported in Appendix A.2, making the results from the two simulations approximately comparable. In particular, the parameters of model (5) are implicitly defined by taking

β_{0} = 0

and

β_{j} = (- 1)^{j + 1} / d, j = 1, \dots, d - 1,

which, in turn, are the parameter values when the data are generated from the model proposed by Richardson et al.⁹ In the definition of

β,

the rescaling factor

1 / d

has been imposed like in Sur and Candes⁵ to prevent the possibility of having probabilities that converge to 0 or 1 as the number of parameters increases.

Table 4 reports the number of failed convergences of each method and for every possible combination of $n$ and $p .$ With fixed $n,$ as the number of parameters increases, all the estimators tend to have a higher number of failed convergences. At the same time, it is clear that the mean and median bias-reduced estimators are remarkably more computationally stable than maximum likelihood. This is particularly evident when the sample size is $n = 40$ and $p = 16, 20.$ In this context, the percentage of failed convergence for maximum likelihood is always larger than $95 %,$ while for the bias-reduced estimators, it varies between $0.98 %$ and $12.12 % .$ In addition, the algorithms to obtain the mean and median bias-reduced estimators under model (5) are again more computationally stable than the ones for model (2).

Table 4.
Number of failed convergences for each estimator in the simulation study of Section 7.2.

Estimator $p = 4$ $p = 8$ $p = 12$ $p = 16$ $p = 20$ Estimator $p = 4$ $p = 8$ $p = 12$ $p = 16$ $p = 20$

$n = 40$ $n = 60$

${\hat{θ}}^{A}$ 1597 4475 7943 9704 9980 ${\hat{θ}}^{A}$ 385 1226 3104 6900 9435

$\hat{θ}$ 2026 4409 8019 9617 9942 $\hat{θ}$ 561 1350 3306 6991 9375

${\hat{θ}}^{* A}$ 0 6 22 98 505 ${\hat{θ}}^{* A}$ 0 0 10 28 41

${\hat{θ}}^{}$ 9 56 81 210 1054 ${\hat{θ}}^{}$ 7 33 67 116 120

${\tilde{θ}}^{A}$ 0 85 197 237 693 ${\tilde{θ}}^{A}$ 0 9 92 213 196

$\tilde{θ}$ 9 183 306 388 1212 $\tilde{θ}$ 4 58 155 323 317

$n = 80$ $n = 100$

${\hat{θ}}^{A}$ 107 446 893 2659 6191 ${\hat{θ}}^{A}$ 26 89 118 601 2282

$\hat{θ}$ 162 458 971 2845 6339 $\hat{θ}$ 43 75 180 757 2520

${\hat{θ}}^{* A}$ 0 0 3 17 33 ${\hat{θ}}^{* A}$ 0 0 0 6 17

${\hat{θ}}^{}$ 9 50 70 77 115 ${\hat{θ}}^{}$ 8 50 74 97 118

${\tilde{θ}}^{A}$ 0 2 26 90 170 ${\tilde{θ}}^{A}$ 0 0 2 28 82

$\tilde{θ}$ 9 39 74 144 241 $\tilde{θ}$ 11 49 72 118 163

Estimator	$p = 4$	$p = 8$	$p = 12$	$p = 16$	$p = 20$	Estimator	$p = 4$	$p = 8$	$p = 12$	$p = 16$	$p = 20$
$n = 40$						$n = 60$
${\hat{θ}}^{A}$	1597	4475	7943	9704	9980	${\hat{θ}}^{A}$	385	1226	3104	6900	9435
$\hat{θ}$	2026	4409	8019	9617	9942	$\hat{θ}$	561	1350	3306	6991	9375
${\hat{θ}}^{* A}$	0	6	22	98	505	${\hat{θ}}^{* A}$	0	0	10	28	41
${\hat{θ}}^{*}$	9	56	81	210	1054	${\hat{θ}}^{*}$	7	33	67	116	120
${\tilde{θ}}^{A}$	0	85	197	237	693	${\tilde{θ}}^{A}$	0	9	92	213	196
$\tilde{θ}$	9	183	306	388	1212	$\tilde{θ}$	4	58	155	323	317
$n = 80$						$n = 100$
${\hat{θ}}^{A}$	107	446	893	2659	6191	${\hat{θ}}^{A}$	26	89	118	601	2282
$\hat{θ}$	162	458	971	2845	6339	$\hat{θ}$	43	75	180	757	2520
${\hat{θ}}^{* A}$	0	0	3	17	33	${\hat{θ}}^{* A}$	0	0	0	6	17
${\hat{θ}}^{*}$	9	50	70	77	115	${\hat{θ}}^{*}$	8	50	74	97	118
${\tilde{θ}}^{A}$	0	2	26	90	170	${\tilde{θ}}^{A}$	0	0	2	28	82
$\tilde{θ}$	9	39	74	144	241	$\tilde{θ}$	11	49	72	118	163

Finally, Figure 3 reports the finite sample properties of mean and median bias-reduced estimators under model (5). Maximum likelihood is omitted due to the high number of failed convergences. The figure highlights that the bias-reduced estimator of Firth¹⁵ has, in general, the lowest bias while the median bias-reduced estimator of Kenne Pagui et al.¹⁶ presents values of PU closest to $50 % .$ RMSE and WALD are instead comparable between the two estimators. Simulations results, obtained under the model of Richardson et al.,⁹ are reported in Figure S2 of the Supplemental material and lead, in practice, to the same conclusions.

Figure 3.

Finite sample properties of mean bias-reduced estimator (triangle) and median bias-reduced estimator (square) of $γ_{0}$ under model (5) specified as in Section 7.2.

8. Discussion

The aim of this work was to develop methods for accurate estimation of the log-relative risk in challenging contexts, such as in situations of small sample size or when infinite maximum likelihood estimates may occur. This goal is reached by implementing the mean and median bias reduction techniques developed by Firth¹⁵ and Kenne Pagui et al.¹⁶ to model (5), proposed in Section 4 as an improvement of the model of Richardson et al.⁹ in order to gain computational stability of the estimation method.

The simulation study in Section 7.1 confirms that, when both maximum likelihood and bias-reduced estimators exist with high probability, mean and median bias reduction are highly preferable to maximum likelihood in terms of finite sample properties. In particular, the median bias-reduced estimator presents a remarkable centring around the true value of the parameter in both parametrizations and a bias which is much lower than the one of the maximum likelihood estimator. On the contrary, the method of Firth¹⁵ is the most effective in reducing the mean bias of the maximum likelihood estimator. In general, this results also in a good centring of the estimator around the true value of the parameter on interest. Even though not supported by the theory yet, a similar behaviour seems to hold also in the challenging setting with many nuisance parameters described in Section 7.2.

Moreover, both Sections 6 and 7 show that bias reduction methods are also effective in preventing non-finite estimates and, in general, they provide estimators which are more computationally stable. When the sample size increases, this property gets a little lost if we adopt the parametrization proposed by Richardson et al.⁹ This is probably due to the fact that the estimation algorithms are quite sensitive to the correction that has to be imposed when the linear predictor associated with the nuisance parameter is close to zero. On the contrary, this problem is not present in the alternative parametrization proposed in this paper, making the latter particularly attractive for practical purposes.

We conclude by highlighting aspects that may be relevant for future work. First, the methods developed in this paper concern estimation of log-relative risk with one binary exposure. Extensions that allow continuous or categorical exposures as in Yin et al.³⁰ are feasible in principle, but not straightforward, and additional work in this direction is needed. A further point deserving consideration is possible model misspecification and its consequences, in particular for what concerns the nuisance component. Ideally, the behaviour of estimators of $γ$ should be stable under different nuisance specifications. Although an extensive analysis of this point is outside the scope of this paper, we may consider the results in Section 6 as a first indication. Table 2 shows that if the choice is between model (5) and model (2), the selected estimation method impacts more than the nuisance specification. Of course, simulation results of Section 7 and, more generally, the finite sample properties of bias reduction methods depend on the assumption that the model is a reasonable approximation of the data-generating mechanism. Moreover, mean and median bias reduction methods are developed for parametric models and a careful selection of the explanatory variables is needed to apply these techniques. The extension of bias reduction to semiparametric models like the one in Section 3 of Richardson et al.⁹ represents a challenging open question.

Supplemental Material

sj-zip-1-smm-10.1177_09622802231167436 - Supplemental material for Improved and computationally stable estimation of relative risk regression with one binary exposure

Supplemental material, sj-zip-1-smm-10.1177_09622802231167436 for Improved and computationally stable estimation of relative risk regression with one binary exposure by Francesco Pozza, Euloge Clovis Kenne Pagui and Alessandra Salvan in Statistical Methods in Medical Research

Supplemental Material

sj-pdf-2-smm-10.1177_09622802231167436 - Supplemental material for Improved and computationally stable estimation of relative risk regression with one binary exposure

Supplemental material, sj-pdf-2-smm-10.1177_09622802231167436 for Improved and computationally stable estimation of relative risk regression with one binary exposure by Francesco Pozza, Euloge Clovis Kenne Pagui and Alessandra Salvan in Statistical Methods in Medical Research

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship and/or publication of this article.

ORCID iDs

Francesco Pozza

Alessandra Salvan

Supplemental material

Supplementary material for this article is available online.

References

Agresti

. Categorical data analysis. 3rd ed. Hoboken, NJ: John Wiley & Sons, 2013.

Hastie

Tibshirani

Friedman

. The elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York: Springer, 2009.

Kolassa

Tanner

. Small-sample confidence regions in exponential families. Biometrics 1999; 55: 1291–1294.

Kosmidis

Kenne Pagui

Sartori

. Mean and median bias reduction in generalized linear models. Stat Comput 2020; 30: 43–59.

Sur

Candès

. A modern maximum-likelihood theory for high-dimensional logistic regression. Proc Natl Acad Sci USA 2019; 116: 14516–14525.

Lumley

Kronmal

. Relative risk regression in medical research: models, contrasts, estimators, and algorithms. UW Biostatistics Working Paper Series, Working paper 293. 2006; http://biostats.bepress.com/uwbiostat/paper293/.

Rothman

Greenland

and Lash

. Modern epidemiology. 3rd ed. Philadelphia: Wolters Kluwer Health/Lippincott Williams & Wilkins, 2008.

McNutt

Xue

et al. Estimating the relative risk in cohort studies and clinical trials of common outcomes. Am J Epidemiol 2003; 157: 940–943.

Richardson

Robins

Wang

. On modeling and estimation for the relative risk and risk difference. J Am Stat Assoc 2017; 112: 1121–1130.

10.

Donoghoe

Marschner

. Logbin: an R package for relative risk regression using the log-binomial model. J Stat Softw 2018; 86: 1–22.

11.

Barros

Hirakata

. Alternatives for logistic regression in cross-sectional studies: an empirical comparison of models that directly estimate the prevalence ratio. BMC Med Res Methodol 2003; 3: 1–13.

12.

Zou

. A modified Poisson regression approach to prospective studies with binary data. Am J Epidemiol 2004; 159: 702–706.

13.

Huber

. The behavior of maximum likelihood estimates under nonstandard conditions. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability 1967; vol. 1, pp.221–233.

14.

Albert

Anderson

. On the existence of maximum likelihood estimates in logistic regression models. Biometrika 1984; 71: 1–10.

15.

Firth

. Bias reduction of maximum likelihood estimates. Biometrika 1993; 80: 27–38.

16.

Kenne Pagui

Salvan

Sartori

. Median bias reduction of maximum likelihood estimates. Biometrika 2017; 104: 923–938.

17.

Kosmidis

Firth

. Jeffreys-prior penalty, finiteness and shrinkage in binomial-response generalized linear models. Biometrika 2021; 108: 71–82.

18.

Kyriakou

Kosmidis

Sartori

. Median bias reduction in random-effects meta-analysis and meta-regression. Stat Methods Med Res 2019; 28: 1622–1636.

19.

Kenne Pagui

Colosimo

. Adjusted score functions for monotone likelihood in the Cox regression model. Stat Med 2020; 39: 1558–1572.

20.

Kenne Pagui

Salvan

Sartori

. Improved estimation in negative binomial regression. Stat Med 2022; 41: 2403–2416.

21.

Kenne Pagui

Pozza

Salvan

. Improved maximum likelihood estimator in relative risk regression. In: Book of short papers SIS 2021. Pearson, pp. 1138–1143. https://www.pearson.com/.

22.

R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing, 2022. https://www.R-project.org/.

23.

Goorin

Perez-Atayde

Gebhardt

et al. Weekly high-dose methotrexate and doxorubicin for esteosarcoma: the Dana-Farber Cancer Institute/The Children’s Hospital–Study III. J Clin Oncol 1987; 5: 1178–1184.

24.

Hirji

Mehta

Patel

. Computing distributions for exact logistic regression. J Am Stat Assoc 1987; 82: 1110–1117.

25.

Carter

Lipsitz

Tilley

. Quasi-likelihood estimation for relative risk regression models. Biostatistics 2005; 6: 39–44.

26.

de Andrade

de Leon Andrade

. Some results for maximum likelihood estimation of adjusted relative risks. Commun Stat-Theory Method 2018; 47: 5750–5769.

27.

Kosmidis

Firth

. A generic algorithm for reducing bias in parametric estimation. Electron J Stat 2010; 4: 1097–1112.

28.

Kenne Pagui

Salvan

Sartori

. Efficient implementation of median bias reduction with applications to general regression models. 2020; http://arxiv.org/abs/2004.08630.

29.

Meng

Zeng

, et al. On the phase transition of Wilks’ phenomenon. Biometrika 2021; 108: 741–748.

30.

Yin

Markes

Richardson

, et al. Multiplicative effect modeling: the general case. Biometrika 2022; 109: 559–566.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

33.39 MB

0.00 MB

0.96 MB