Analysis of longitudinal zero-inflated count data using overall marginalized hurdle models

Abstract

Longitudinal zero-inflated count data frequently arise in various fields such as medicine and social sciences. Standard hurdle models separate zero and positive counts, but fail to directly infer marginal means, limiting their ability to assess overall covariate effects. This paper introduces overall marginalized hurdle random effects models (OMHREMs) for zero-inflated count data, which extends the traditional hurdle model by directly modeling the marginal mean while considering random effects to account for heterogeneity. OMHREMs enable population-average effects of covariates like odds ratio, providing how covariates influence the overall mean in zero-inflated count data. Through simulation studies, we evaluate the performance of OMHREMs. Furthermore, we apply our approach to systemic lupus erythematosus data to compare its effectiveness against existing models.

Keywords

Hurdle model zero inflation heterogeneous random effects overdispersed data

1. Introduction

In fields of medical and social sciences, it is common to encounter zero-inflated count data, which often signify infrequent occurrences such as lottery wins or specific disease diagnoses. A widely adopted method for analyzing this type of data is the zero-inflated Poisson (ZIP) model.¹ A key challenge with the ZIP model is that its parameters related to zero inflation can be difficult to interpret intuitively because zero outcomes can arise from either a binary decision or a count process, reflecting its mixed distribution structure. Moreover, the ZIP model faces significant limitations when data display zero deflation (fewer zeros than expected), as its binary component's parameter may diverge to infinity,^2,3 causing estimation problems.

In contrast, the hurdle model⁴ provides a more adaptable and stable framework. It addresses count data by distinctly partitioning it into two components: a binary model that differentiates between zero and nonzero observations, and a separate truncated count model that describes only the positive outcomes. This design allows the hurdle model to effectively manage both zero-inflated and zero-deflated data, making it a more dependable choice for analysis. While many discussions on choosing between hurdle and ZIP models focus on model fit statistics, some researchers, including Gilthorpe et al.⁵ and supported by Neelon et al.⁶ and Buu et al.,⁷ suggest that an understanding of the underlying data-generating mechanism should inform this decision. For example, applications where all zeros are understood to stem from a single process point towards the use of a hurdle model, rather than a ZIP model, which posits that zeros can originate from two different processes.

Analyzing longitudinal zero-inflated count data presents unique challenges, especially when interpreting the conditional mean with models like the hurdle model, where parameters often relate to distinct subpopulations. Several methods have been proposed to address this. Min and Agresti² introduced a two-part random effects model, referred to in this paper as the hurdle random effects model (HREM), which features a binary component for zero occurrences as a truncated count model for positive values. Neelon et al.⁶ presented a Bayesian strategy for analyzing such data. Recognizing that the Poisson distribution assumes equidispersion, which limits the zero-inflated hurdle model's ability to compute the mean-variance relationship, Zhu et al.⁸ extended these models by including heterogeneous random effects. This extension allows the variance to be modeled as a function of covariates in longitudinal data. Additionally, Kang et al.⁹ extended the Poisson hurdle model to a longitudinal Bayesian mixed effects hurdle Conway–Maxwell–Poisson (CMP) model, utilizing a Bayesian generalized linear mixed model approach.

In the analysis of longitudinal zero-inflated data, marginal model approaches are preferred when the primary interest lies in the marginal probability of the zero component and the marginal mean of the count component. Hall and Zhang¹⁰ developed a marginal modeling approach combining generalized estimating equations (GEEs) with the expectation–maximization (EM) algorithm to account for data correlation without requiring additional model assumptions.

However, their GEE framework has notable limitations: it does not allow for separate regression parameters for the zero-inflated and positive count components, and it is inapplicable under the missing at random (MAR) assumption. To overcome these constraints, Lee et al.¹¹ proposed a marginalized HREM (MHREM) and marginalized negative binomial (NB) hurdle model for clustered data. While these models specify the marginal relationship between responses and covariates using random effects to explain clustered dependence, a key limitation remains: the marginal probability and mean are still conditional on the count being zero or positive, respectively.

This methodological gap is critical because, in clinical and health services research, marginal means are essential for providing covariate-adjusted, population-average outcomes. These outcomes are directly interpretable and serve as key metrics for quantifying disease burden and healthcare utilization, making them ideal for answering population-based research questions. Even with skewed data, the mean remains fundamental for estimating average hospitalizations, which is a vital step for resource allocation and policy planning.

When evaluating overall covariate effects, standard hurdle models—and similarly, standard zero-inflated models¹²—distinguish between zero and positive counts. However, this distinction often lacks clinical relevance and limits direct inference on the overall mean response. To address this, Long et al.¹³ proposed marginalized ZIP models, which enable direct inference by averaging over the two ZIP model processes. Although Long et al.¹⁴ extended this to accommodate longitudinal zero-inflated count data by modeling the subject-specific response mean given random effects, the covariate effects on the marginal probability of excess zeros were still not explained directly, as they were modeled using conditional logistic regression. In contrast, Lee et al.¹⁵ proposed overall marginalized ZIP random effects models (MZIPREMs) for longitudinal zero-inflated count data. These new models not only directly capture marginal covariate effects on the overall mean but also rigorously account for heterogeneity through random effects, providing a more robust framework for clinical research.

In this paper, we adapt the overall MZIPREM approach proposed by Lee et al.¹⁵ to the hurdle model framework. The primary aim is to simultaneously account for covariate effects on both the marginal mean and the marginal probability of excess zero for longitudinal zero-inflated count data. Our goal is to analyze both zero-inflated and zero-deflated longitudinal count data using overall marginalized hurdle random effects models (OMHREMs). These proposed models are designed to capture marginal covariate effects on the overall marginal mean, account for subject-specific heterogeneity via random effects’ covariance, and ultimately provide interpretable population-average effects of covariates on the count response.

The structure of this paper is as follows: Section 2 introduces the motivating example using the Lupus dataset. Section 3 provides a comprehensive review of relevant literature on hurdle models. In Section 4, we introduce overall marginalized hurdle Poisson models with random effects (OMHREM), designed for longitudinal zero-inflated count data, and incorporate a heterogeneous covariance structure to capture marginal covariate effects. The likelihood approach for the OMHREM is further detailed in this section. Section 5 presents simulation studies to evaluate the bias and efficiency of our proposed estimation method. In Section 6, the OMHREM is applied to the Lupus dataset. Finally, Section 7 presents a discussion and concluding remarks.

2. Motivating data: Lupus data

The systemic lupus erythematosus (SLE) dataset utilized in this study is a longitudinal cohort with repeated measurements on individuals from a 10-year period (2008–2018). It was sourced from the Korean National Health Insurance Service (NHIS), a public health program covering approximately 98% of the Korean population. The NHIS database serves as a valuable resource for population-based research, providing comprehensive healthcare data, including patient sociodemographic characteristic, diagnostic codes, healthcare utilization SLE is a chronic autoimmune disease characterized by dysregulation of immune responses, inflammation, and the formation of immune complexes, potentially affecting any organ or tissue. Its symptoms vary widely among patients and fluctuate over time, making diagnosis particularly challenging.¹⁶ SLE flare-ups can lead to permanent organ damage, with approximately 50% of patients experiencing such outcomes within a decade of diagnosis.¹⁷ The global incidence and prevalence of SLE are increasing, and the disease is associated with a mortality rate two to three times higher than that of the general population. Despite the availability of treatments, over half of SLE patients demonstrate poor medication adherence. These challenges highlight the critical importance of early detection and timely intervention.¹⁸

The NHIS dataset has been widely applied in recent literature: Kim et al.¹⁷ estimated direct healthcare costs for newly diagnosed patients, Han et al.¹⁹ assessed cardiovascular risks, and Jang et al.²⁰ addressed outliers in longitudinal cost data using multivariate linear models. Building on these studies, our research aims to identify factors associated with hospitalization frequency in severe SLE patients by analyzing annual hospitalization counts over a 10-year period (2008–2017).

2.1 Hospitalization variable

The primary response variable is the annual number of hospitalizations. Figure 1(a) illustrates the distribution of overall hospitalizations aggregated across all years, showing that 61.7% of the observations are zero, which indicates substantial zero inflation. The non-zero subset (6814 observations) is highly skewed, with most counts concentrated at low values (Figure 1(b)).

Figure 1.

Histograms of the number of hospitalizations: (a) zero counts and (b) non-zero counts over time, highlighting the frequency of non-zero observations. (a) Number of hospitalizations. (b) Number of non-zero hospitalizations.

By examining only the non-zero counts in Figure 1(b), we can see a highly skewed distribution. This subset of data, which has a reduced sample size of 6814 compared to the original 17,736 observations, shows that most nonzero hospitalizations are concentrated at low values.

As shown in Figure 2, the overall mean change in hospitalization over the years shows a distinct pattern. The first year exhibits the highest overall mean at 2.87, which may be attributed to heightened attention during the initial phase. A sharp reduction is evident in the second year, where the mean plummets to 0.93. Thereafter, the mean gradually decreases with minimal fluctuations, suggesting a stabilization of hospitalization trends in the later years.

Figure 2.

Average hospitalization by period.

2.2 Explanatory variable

The key explanatory variables are gender (0 = male, 1 = female), age, and Charlson's comorbidity index (CCI). For analysis, age was log transformed to $\log (Age)$ , and CCI was scaled by dividing by 10. The CCI is a widely used measure that classifies comorbid conditions based on their potential impact on mortality risk, and is often employed to estimate the 10-year survival probability in patients with multiple comorbidities.²¹ Due to its capacity to quantify the severity of comorbid conditions, CCI is frequently applied in studies assessing health outcomes in patients with complex diseases such as SLE. Missing data due to dropout were assumed to be MAR.

The baseline characteristics of the explanatory variables are presented in Table 1. Hospitalization in severe SLE is a clinically significant and costly outcome that varies by patient demographics and comorbidities. The repeated measures, count-based nature of the outcome, and high proportion of zero counts necessitate a modeling framework capable of simultaneously accounting for longitudinal structure and zero inflation. Accordingly, this study identifies predictors of hospitalization frequency by applying a marginal means approach, as described in Section 6, to ensure that the results are directly interpretable for population-level clinical and policy decisions.

Table 1.

Baseline characteristics of the explanatory variables.

Variable	Proportion or mean	SD
Gender
Male	0.188	–
Female	0.812	–
Age	41.635	16.958
Charlson’s comorbidity index (CCI)/10	0.274	0.183

3. Literature review

In this section, we present a detailed description of statistical methods for analyzing zero-inflated count data. The discussion begins with the standard hurdle model and its extension for longitudinal studies, the HREM. We then review a marginalized hurdle model which offers a distinct approach by directly modeling the impact of covariates on the marginal mean response.

3.1 Hurdle model

The hurdle model, originally proposed by Mullahy,⁴ is a two-part mixture model designed for count data exhibiting excess zeros in cross-sectional studies. The model assumes that all zeroes are from one structural source, with the first part being a binary model that determines whether the count is zero or positive, while the second part models the positive counts using a zero-truncated Poisson or truncated NB distribution.

Let $y_{i}$ denote the observed count for subject $i (i = 1, \dots, N)$ where N is the total number of observations, and let $λ_{i}$ be the mean parameter of the Poisson distribution. The probability mass function of the Poisson hurdle model is given by:

P (y_{i}) = {\begin{cases} p_{i}, & if y_{i} = 0; \\ (1 - p_{i}) \frac{λ_{i}^{y_{i}} e^{- λ_{i}}}{y_{i}! (1 - e^{- λ_{i}})}, & if y_{i} = 1, 2 \dots \end{cases}

(1)

Where the model components are assumed as follows:

\begin{aligned} logit (p_{i}) & = x_{i}^{T} γ, \end{aligned}

\begin{aligned} \log (λ_{i}) & = x_{i}^{T} β, \end{aligned}

with

x_{i}

denoting a

p \times 1

covariate vector, and

γ

and

β

representing the corresponding parameter vectors for the zero and count components, respectively. Specifically,

γ

represents the effect of covariates on log-odds of

y_{i}

being zero.

β

reflects the effect of the covariates on the Poisson mean.

The Hurdle model is capable of addressing both zero inflation and zero deflation in count data. When $(1 - p_{i}) > e^{- λ_{i}}$ , the data exhibit zero inflation, whereas if $(1 - p_{i}) < e^{- λ_{i}}$ indicates zero deflation. In the special case where $p_{i} = 0$ , all zero counts are excluded and the model is reduced to a truncated Poisson distribution.⁶

3.2 Hurdle random effects model

Min and Agresti² proposed the HREM to accommodate longitudinal zero-inflated count data by incorporating subject-specific random effects. Let $y_{i t}$ denote the count response for subject $i (i = 1, . . ., N)$ at time point $t (t = 1, \dots, n)$ , and let $x_{i t}$ denote the corresponding covariates. We assume $y_{i t}$ 's are conditionally independent given the subject-specific random effects $b_{i}$ . The probability mass function of the HREM is specified as follows:

P (y_{i t}) = {\begin{cases} p_{i t} (b_{i 1}), & if y_{i t} = 0; \\ (1 - p_{i t} (b_{i 1})) \frac{λ_{i t}^{y_{i t}} (b_{i 2}) e^{- λ_{i t} (b_{i 2})}}{y_{i t}! (1 - e^{- λ_{i t} (b_{i 2})})}, & if y_{i t} = 1, 2 \dots \end{cases}

(2)

where the model components are assumed as follows:

\begin{aligned} logit (p_{i t} (b_{i 1})) & = x_{i t 1}^{T} γ^{c} + z_{i t 1}^{T} b_{i 1}, \end{aligned}

(3)

\begin{aligned} \log (λ_{i t} (b_{i 2})) & = x_{i t 2}^{T} β^{c} + z_{i t 2}^{T} b_{i 2}, \end{aligned}

(4)

with

x_{i t k}

and

z_{i t k}

for

k = 1, 2

representing covariate vectors for the logistic and truncated count parts, respectively. The vectors

γ^{c}

and

β^{c}

are fixed effect parameters, and

b_{i} = (b_{i 1}^{T}, b_{i 2}^{T})^{T}

are the random effects vector capturing the within-subject correlation. We assume the random effects vector

b_{I}

are independently and identically distributed as:

b_{i} \sim^{i i d} N (0, Σ), Σ = [\begin{array}{lc} Σ_{11} & Σ_{12} \\ Σ_{21} & Σ_{22} \end{array}]

(5)

where

Σ

is a positive-definite covariance matrix. The submatrices

Σ_{11}

and

Σ_{22}

represent the covariance structures of the random effects associated with the zero-inflation and count components, respectively. The off-diagonal submatrix

Σ_{12}

captures the correlation between the random effects influencing the occurrence of zero and non-zero counts.

In many practical applications, a simple random intercept model provides sufficient flexibility for capturing subject-specific variability. Min and Agresti² considered the case where both random effects, $b_{i 1}$ and $b_{i 2}$ , are univariate. Accordingly, we set $z_{i t 1} = z_{i t 2} = 1$ , and specify the random effects covariance matrix $Σ$ as follows:

Σ = (\begin{array}{lc} σ_{1}^{2} & ρ σ_{1} σ_{2} \\ ρ σ_{1} σ_{2} & σ_{2}^{2} \end{array})

(6)

Note that Zhu et al.⁸ proposed incorporating a subject-specific positive-definite covariance matrix $Σ_{I}$ , characterized by variances $σ_{i 1}^{2}$ and $σ_{i 2}^{2}$ , in ZIP models with random effects. This flexible covariance structure allows for heterogeneous random effects within the context of ZIP and zero-inflated NB models. In contrast, Min and Agresti² considered a homogeneous random effects covariance matrix.

Estimating parameters in HREMs typically employs either likelihood-based or Bayesian approaches. Neelon et al.⁶ proposed a flexible Bayesian method for longitudinal zero-inflated count data that accommodates correlated random effects and uses prior information for effective inference. A key limitation of Bayesian methods, however, is their high computational cost, particularly when using Markov Chain Monte Carlo (MCMC) algorithms.

The likelihood-based approach presents a distinct computational challenge, as it necessitates integrating over the random effects. This issue is generally resolved through approximation techniques such as Laplace approximation,²² adaptive Gaussian quadrature,²³ or penalized quasi-Newton methods, which facilitate the approximation or direct maximization of the marginal likelihood.

\begin{aligned} L (θ; y) & = \prod_{i = 1}^{N} L (θ; y_{i}) \\ = \prod_{i = 1}^{N} \int \prod_{t = 1}^{n_{i}} p_{i t} (b_{i 1})^{I_{(y_{i t} = 0)}} {1 - p_{i t} (b_{i 1}) \frac{g (y_{i t}; λ_{i t} (b_{i 2}))}{1 - \exp (- λ_{i t} (b_{i 2}))}}^{1 - I_{(y_{i t} = 0)}} f (b_{i}) d b_{i}, \end{aligned}

where $g (y_{i t}; λ_{i t} (b_{i 2})$ are the Poisson probability mass function, $I_{A}$ is the indicator function for A, and $f (.)$ is the multivariate normal density function for $N (0, Σ)$ .

HREMs have been extended in several ways. First, to relax the assumption of normally distributed random effects, Molas and Lesaffre²⁴ proposed a h-likelihood approach that can accommodate a wider range of distributions. Second, to better account for unobserved heterogeneity, Park et al.²⁵ proposed a nonparametric Bayesian Poisson hurdle model, which uses nonparametric techniques to provide more flexibility in the random effects distribution. Third, addressing the standard Poisson distribution‘s assumption of equidispersion, Kang et al.⁹ extended the Poisson hurdle model with the CMP distribution. The CMP distribution is particularly useful because it includes an extra parameter to flexibly model underdispersion, equidispersion, or overdispersion, thereby improving the model's ability to capture the mean-variance relationship in real-world data.

3.3 Marginalized random effects model

While HREMs are useful for capturing subject-specific variability, they do not directly produce marginal means, which are often the primary focus for population-level inference. To address this, Lee et al.¹¹ proposed the MHREM, using a likelihood-based estimation method. These models provide a direct and interpretable way to analyze how covariates affect the marginal means of both the binary and non-zero count responses. In the following subsection, we will review the MHREM.

P (y_{i t} | x_{i t}) = {\begin{array}{lc} p_{i t}^{M}, & if y_{i t} = 0; \\ (1 - p_{i t}^{M}) \frac{{(λ_{i t}^{M})}^{y_{i t}} e^{- λ_{i t}^{M}}}{y_{i t}! (1 - e^{- λ_{i t}^{M}})}, & if y_{i t} = 1, 2 \dots \end{array}

(7)

The model components are assumed as follows:

\begin{aligned} logit (p_{i t}^{M}) & = x_{i t 1}^{T} γ^{M}, \end{aligned}

(8)

\begin{aligned} \log (λ_{i t}^{M}) & = x_{i t 2}^{T} β^{M} + \log ({off}_{i t}), \end{aligned}

(9)

where

γ^{M}

and

β^{M}

are respectively

p_{1} \times 1

and

p_{2} \times 1

vectors of marginal regression parameters, and

x_{i t 1}

and

x_{i t 2}

are respectively covariate vectors corresponding

γ^{M}

and

β^{M}

. The parameter vector

γ^{M}

explains the effect of covariates on the log-odds of observing a zero count, while

β^{M}

describes the effect of covariates on the expected count of positive values. Thus,

γ^{M}

and

β^{M}

provide an interpretable, subgroup-specific understanding of the data by describing the probability of a zero outcome and the parameter in the truncated Poisson distribution, respectively. The term

{off}_{i t}

is the offset used to account for differential exposure times.

The conditional hurdle model is extended by incorporating random effects to handle the serial correlation often present in repeated measurements. Specifically, let $b_{i} = (b_{i 1}^{T}, b_{i 2}^{T})^{T}$ represents the subject-specific random effects. This extension leads to the following conditional model:

P (y_{i t} | x_{i t}, b_{i}) = {\begin{array}{lc} p_{i t}^{c} (b_{i 1}), & if y_{i t} = 0; \\ (1 - p_{i t}^{c} (b_{i 1})) \frac{{(λ_{i t}^{c} (b_{i 2}))}^{y_{i t}} e^{- λ_{i t}^{c} (b_{i 2})}}{y_{i t}! (1 - e^{- λ_{i t}^{c} (b_{i 2})})}, & if y_{i t} = 1, 2 \dots \end{array}

(10)

where

p_{i t}^{c} (b_{i 1})

and

λ_{i t}^{c} (b_{i 2})

have the following flexibly-specific models:

\begin{aligned} logit (p_{i t}^{c} (b_{i 1})) & = Δ_{i t 1} + z_{i t 1}^{T} b_{i 1}, \end{aligned}

(11)

\begin{aligned} \log (λ_{i t}^{c} (b_{i 2})) & = Δ_{i t 2} + z_{i t 2}^{T} b_{i 2}, \end{aligned}

(12)

where

z_{i t 1}

and

z_{i t 2}

are covariate vectors associated with the random effects

b_{i 1}

and

b_{i 2}

, respectively. The assumption for

(b_{i 1}^{T}, b_{i 2}^{T})^{T}

is the same as that in (5).

Δ_{i t 1}

and

Δ_{i t 2}

represent deterministic intercepts, and their calculation will be presented later.

From Equations (8), (9), (11) and (12), the deterministic intercepts $Δ_{i t 1}$ and $Δ_{i t 2}$ are implicitly determined by the fixed effects $(γ^{M}, β^{M})$ and the covariance matrix $Σ$ . Consequently, $Δ_{i t 1}$ and $Δ_{i t 2}$ are derived using the following identities:

\begin{aligned} p_{i t}^{M} & = E [p_{i t}^{c} (b_{i 1})], \end{aligned}

\begin{aligned} λ_{i t}^{M} & = E [λ_{i t}^{c} (b_{i 2})], \end{aligned}

Using these relationships, $Δ_{i t 1}$ and $Δ_{i t 2}$ are calculated iteratively using the Newton-Raphson algorithm.

3.4 Classification of models

To summarize and clarify the models in this section, we present Table 2, which provides a hierarchical cross-classification of existing studies to identify the specific methodological gaps addressed by our work. Following the suggested guiding principles, we categorize the literature based on (i) the target of inference (overall mean effects versus latent class interpretations), (ii) the presence of random-effect marginalization, and (iii) the homogeneity or heterogeneity of the random-effect covariance matrix.

Table 2.

Classification of models for clustered zero-inflated count data.

Overall mean	Marginalized model	Heterogeneous random effects	Zero-inflated poisson (ZIP) model	Hurdle model
No	No	No	Neelon et al.⁶	Min and Agresti²
	No	Yes	Zhu et al.⁸
	Yes	No		Lee et al.¹¹
	Yes	Yes
Yes	No	No
	No	Yes	Long et al.¹⁴
	Yes	No
	Yes	Yes	Lee et al.²⁰	OMHREM

4. Overall marginalized hurdle random effects models

In this section, we adapt Long et al.¹³'s overall marginalized ZIP model to accommodate longitudinal zero-inflated count data within a hurdle model framework. Denote $y_{i t}$ as the count response variable for the subject i at time point $t (i = 1, \dots, N; t = 1, \dots, n_{i})$ , and let $x_{i t}$ be the covariates corresponding to $y_{i t}$ .

4.1 Proposed model

We extend the MHREM presented in Lee et al.¹¹ to account for the overall marginal effect of covariates, and refer to the resulting model as the OMHREM. This model captures both the Poisson mean and the probability of a zero count, as in traditional hurdle models, while enabling marginal interpretations of covariate effects.

The OMHREMs comprise three components: the marginal model, the dependence model, and the overall mean model. The marginal model adopts a standard Poisson hurdle structure and is the same as (7) in the marginalized random effects model. We also consider the model components in (8) and (9). This specification allows for a latent class interpretation of the likelihood by distinguishing between the zero-count group and the positive-count group.

To account for the serial correlation in repeated count data, we consider the dependence model (10) in the marginalized random effects model to incorporate subject-specific random effects. In our proposed model, the conditional model components are specified differently using generalized linear models:

\begin{aligned} Φ^{- 1} (p_{i t}^{c} (b_{i})) & = Δ_{i t 1} + z_{i t 1}^{T} b_{i 1}, \end{aligned}

(13)

\begin{aligned} \log (λ_{i t}^{c} (b_{i})) & = Δ_{i t 2} + z_{i t 2}^{T} b_{i 2}, \end{aligned}

(14)

where

Φ^{- 1} (\cdot)

is the inverse link function (e.g., the logit or probit link), and

z_{i t 1}

and

z_{i t 2}

are

q_{1} \times 1

and

q_{2} \times 1

covariate vectors, respectively. Note that model (13) for

p_{i t}^{c} (b_{i})

is the probit model, whereas the model (11) is the logit model. The probit and logit models are similar. In our proposed model, we consider the probit link function because of the simplicity of calculation for

Δ_{i t 1}

The subject-specific random effects $b_{i} = (b_{i 1}^{T}, b_{i 2}^{T})^{T}$ are assumed to be independently distributed as follows:

b_{i} \sim N (0, Σ_{i}), Σ_{i} = [\begin{array}{cc} Σ_{i 11} & Σ_{i 12} \\ Σ_{i 21}^{T} & Σ_{i 22} \end{array}],

(15)

where

Σ_{i}

is a subject-specific positive-definite covariance matrix.

Σ_{i 11}

and

Σ_{i 22}

are

q_{1} \times q_{1}

and

q_{2} \times q_{2}

positive-definite submatrices, respectively, and

Σ_{i, 12}

is a

q_{1} \times q_{2}

matrix capturing the covariance between the random effects in the binary and Poisson mean components.

We consider a specific case with random intercepts b_i₁ and b_i₂ by setting $z_{i t 1} = z_{i t 2} = 1$ , following the approach of Min and Agresti.² However, unlike the homogeneous covariance matrix assumed by Min and Agresti,² we specify the random effects covariance matrix $Σ_{i}$ using a variance-correlation decomposition:

Σ_{i} = D_{i} R_{i} D_{i},

(16)

where

D_{i} = (\begin{array}{cc} σ_{i 1} & 0 \\ 0 & σ_{i 2} \end{array}), R_{i} = (\begin{array}{cc} 1 & ρ_{i} \\ ρ_{i} & 1 \end{array}) .

This parameterization allows for heteroskedastic variances and subject-specific correlation while ensuring that $Σ_{i}$ remains positive definite. To model heterogeneity in the standard deviations (SDs) and correlation, we specify generalized linear models:

\begin{aligned} \log (σ_{i 1}) & = h_{i}^{T} ζ_{1}, \end{aligned}

\begin{aligned} \log (σ_{i 2}) & = h_{i}^{T} ζ_{2}, \end{aligned}

\begin{aligned} \frac{1}{2} \log (\frac{1 + ρ_{i}}{1 - ρ_{i}}) & = w_{i}^{T} δ, \end{aligned}

where

ζ_{1}, ζ_{2}

, and

δ

are parameter vectors of dimensions

a \times 1, b \times 1

, and

c \times 1

, respectively. The covariate vectors

h_{i}

and

w_{i}

are subject-specific. Fisher's Z-transformation is applied to the correlation parameter

ρ_{i}

to ensure it remains within the valid interval

[- 1, 1]

Note that the random effects covariance matrix in our proposed model incorporates hetergeneous SDs $(σ_{i 1}, σ_{i 2})$ and a heterogeneous correlation $(ρ_{i})$ . In contrast, the random effects covariance matrices in Min and Agresti² and Lee et al.¹¹ are homogeneous. Hence, this specific heterogeneous random effects covariance structure is unique to our approach.

From equations (8), (9), (13), and (14), we derive the following relationships between the marginal and conditional components:

\begin{aligned} p_{i t}^{M} & = E (p_{i t}^{c} (b_{i})), \end{aligned}

(17)

\begin{aligned} λ_{i t}^{M} & = E (λ_{i t}^{c} (b_{i})), \end{aligned}

(18)

Given that the parameters $Σ_{i}, β$ , and $γ$ are specified, the conditional linear predictors $Δ_{i t 1}$ and $Δ_{i t 2}$ become deterministic. These can be expressed as:

\begin{aligned} Δ_{i t 1} & = \sqrt{1 + σ_{i 1}^{2}} Φ^{- 1} (p_{i t}^{M}), \end{aligned}

(19)

\begin{aligned} Δ_{i t 2} & = x_{i t 2}^{T} β^{M} - \frac{1}{2} σ_{i 2}^{2} + \log {off}_{i t}, \end{aligned}

(20)

where

Φ^{- 1} (\cdot)

denotes the inverse cumulative distribution function of the standard normal distribution (i.e., the probit link function), and

{off}_{i t}

is the offset term. These expressions align the conditional model with the marginal specifications, ensuring consistent interpretation of covariate effects in both parts of the model. Note that

Δ_{i t 1}

and

Δ_{i t 2}

are closed-form expressions; in contrast,

Δ_{i t 1}

and

Δ_{i t 2}

in the marginalized random effects model, specifically in equations (11) and (12), are not closed-form and must be calculated using the Newton-Raphson algorithm.

Consider the overall marginal mean of $Y_{i t}$ , denoted as $μ_{i t} = E (Y_{i t})$ , which is often primary interest of researchers. From (17) and (18), the overall marginal mean $μ_{i t}$ is given by:

μ_{i t} = \frac{1 - p_{i t}^{M}}{1 - e^{- λ_{i t}^{M}}} λ_{i t}^{M} .

(21)

Note that the overall marginal mean µ_it is a function of covariates and parameters associated with $p_{i t}^{M}$ and $λ_{i t}^{M}$ . For the $j$ th covariates $x_{i t 2 j}$ in $x_{i t 2}$ , where we assume $x_{i t 1} = x_{i t 2}$ , the ratio of overall marginal means for a one-unit increase in $x_{i t 2 j}$ is derived from (21) as:

\frac{E (Y_{i j} | x_{i t 2 j} + 1, {\tilde{x}}_{i t 2})}{E (Y_{i j} | x_{i t 2 j}, {\tilde{x}}_{i t 2})} = \exp (β_{j}^{M}) \frac{1 + \exp (x_{i t 2 j} γ_{j}^{M} + {\tilde{x}}_{i t 2}^{T} {\tilde{γ}}^{M})}{1 + \exp ((x_{i t 2 j} + 1) γ_{j}^{M} + {\tilde{x}}_{i t 2}^{T} {\tilde{γ}}^{M})} \frac{1 - e^{- \exp (x_{i t 2 j} β_{j}^{M} + {\tilde{x}}_{i t 2}^{T} {\tilde{β}}^{M})}}{1 - e^{- \exp ((x_{i t 2 j} + 1) β_{j}^{M} + {\tilde{x}}_{i t 2}^{T} {\tilde{β}}^{M})}},

(22)

where

{\tilde{x}}_{i t 1}

denotes all covariates after excluding

x_{i t 2 j}

, and

{\tilde{γ}}^{M}

and

{\tilde{β}}^{M}

are the corresponding parameter vectors after excluding the regression coefficients of

x_{i t 2 j}

from

{\tilde{γ}}^{M}

and

{\tilde{β}}^{M}

, respectively.

From the result in (22), unless $γ_{j}^{M} = β_{j}^{M} = 0$ , the incidence density ratio varies across different levels of the covariates. Consequently, we cannot directly interpret the effect of a single covariate on the overall marginal mean $μ_{i t}$ . For this reason, it is desirable to adopt a model that explicitly characterizes the effect of covariates on the overall marginal mean. One such approach is the overall mean model, specified as follows:

Overall mean model : \log μ_{i t} = x_{i t 2}^{T} α,

(23)

where

α

is the

p_{1} \times 1

parameter vector representing the direct effect of covariates on the log of the marginal mean.

Based on models (21) and (23), we obtain the following relationship:

e^{x_{i t 2} α} = (1 - p_{i t}^{M}) \frac{λ_{i t}^{M}}{1 - e^{- λ_{i t}^{M}}}

4.2 Estimation

Let $θ = (γ^{T}, α^{T}, ζ_{1}^{T}, ζ_{2}^{T}, δ^{T})^{T}$ . The likelihood function of the OMHREMs for $θ$ is given by:

\begin{aligned} L (θ; y) = \prod_{i = 1}^{N} L (θ; y_{i}) & = \prod_{i = 1}^{N} \int \prod_{t = 1}^{n_{i}} p_{i t}^{c} (b_{i})^{I_{(y_{i t} = 0)}} {1 - p_{i t}^{c} (b_{i}) \frac{g (y_{i t}; λ_{i t}^{c} (b_{i}))}{1 - \exp (- λ_{i t}^{c} (b_{i}))}}^{1 - I_{(y_{i t} = 0)}} f (b_{i}) d b_{i}, \end{aligned}

where

f (\cdot)

denotes the bivariate normal density function with mean vector 0 and covariance matrix

Σ_{i}

. Specifically,

f (b_{i}) = (2 π)^{- 1} \exp {- \frac{1}{2} (h_{i}^{T} ζ_{1} + h_{i}^{T} ζ_{2}) - \frac{1}{2} \log (1 - ρ_{i}^{2}) - \frac{1}{2} b_{i}^{T} Σ_{i}^{- 1} b_{i}} .

The corresponding log-likelihood function is:

\begin{aligned} \log L (θ; y) = \sum_{i = 1}^{N} \log L (θ; y_{i}) & = \sum_{i = 1}^{N} \log \int \exp [\sum_{t = 1}^{n_{i}} {I_{(y_{i t} = 0)} \log p_{i t}^{c} (b_{i}) + (1 - I_{(y_{i t} = 0)}) (\log (1 - p_{i t}^{c} (b_{i})) \\ - λ_{i t}^{c} (b_{i}) + y_{i t} \log λ_{i t}^{c} (b_{i}) - \log y_{i t}! - \log (1 - e^{- λ_{i t}^{c} (b_{i})})}] ϕ (b_{i}) d b_{i} . \end{aligned}

(24)

Because the integral in (24) does not have a closed-form solution, we employ Gauss-Hermite quadrature to numerically integrate out the random effects.

Maximizing the log-likelihood with respect to $θ$ results in the following likelihood equations:

\begin{aligned} \frac{\partial \log L (θ; y)}{\partial γ^{M}} = \sum_{i = 1}^{N} \frac{1}{L (θ; y_{i})} \int L (θ, b_{i}; y_{i}) \sum_{t = 1}^{n_{i}} {I_{(y_{i t} = 0)} - Φ (Δ_{i t 1} + b_{i 1})} \frac{ϕ (Δ_{i t 1} + b_{i 1})}{Φ (Δ_{i t 1} + b_{i 1}) (1 - Φ (Δ_{i t 1} + b_{i 1}))} \frac{\partial Δ_{i t 1}}{\partial γ^{M}} ϕ (b_{i}) d b_{i}, \end{aligned}

(25)

\begin{aligned} \frac{\partial \log L (θ; y)}{\partial α} = \sum_{i = 1}^{N} \frac{1}{L (θ; y_{i})} \int L (θ, b_{i}; y_{i}) \sum_{t = 1}^{n_{i}} (1 - I_{(y_{i t} = 0)}) (y_{i t 1} - \frac{λ_{i t}^{c} (b_{i})}{1 - e^{- λ_{i t}^{c} (b_{i})}}) \frac{\partial Δ_{i t 2}}{\partial α} ϕ (b_{i}) d b_{i}, \end{aligned}

(26)

\begin{aligned} \frac{\partial \log L (θ; y)}{\partial ζ_{1 l}} = \sum_{i = 1}^{N} \frac{1}{L (θ; y_{i})} \int L (θ, b_{i}; y_{i}) \\ \times [\sum_{t = 1}^{n_{i}} {I_{(y_{i t} = 0)} - Φ (Δ_{i t 1} + b_{i 1})} \frac{ϕ (Δ_{i t 1} + b_{i 1})}{Φ (Δ_{i t 1} + b_{i 1}) (1 - Φ (Δ_{i t 1} + b_{i 1}))} \frac{\partial Δ_{i t 1}}{\partial ζ_{1 l}} - h_{i l} - \frac{1}{2} b_{i}^{T} \frac{\partial Σ_{i}^{- 1}}{\partial ζ_{1 l}} b_{i}] ϕ (b_{i}) d b_{i} \end{aligned}

(27)

\begin{aligned} \frac{\partial \log L (θ; y)}{\partial ζ_{2 l}} = \sum_{i = 1}^{N} \frac{1}{L (θ; y_{i})} \int L (θ, b_{i}; y_{i}) [\sum_{t = 1}^{n_{i}} (1 - I_{(y_{i t} = 0)}) (y_{i t 1} - \frac{λ_{i t}^{c} (b_{i})}{1 - e^{- λ_{i t}^{c} (b_{i})}}) \frac{\partial Δ_{i t 2}}{\partial ζ_{2 l}} - h_{i l} - \frac{1}{2} b_{i}^{T} \frac{\partial Σ_{i}^{- 1}}{\partial ζ_{2 l}} b_{i}] ϕ (b_{i}) d b_{i}, \\ \frac{\partial \log L (θ; y)}{\partial δ_{l}} = \sum_{i = 1}^{N} \frac{1}{L (θ; y_{i})} \int L (θ, b_{i}; y_{i}) (\frac{ρ_{i}}{1 - ρ_{i}^{2}} \frac{\partial ρ_{i}}{\partial δ_{l}} - \frac{1}{2} b_{i}^{T} \frac{\partial Σ_{i}^{- 1}}{\partial δ_{l}} b_{i}) ϕ (b_{i}) d b_{i}, \end{aligned}

(28)

where

\begin{aligned} \frac{\partial Δ_{i t 1}}{\partial γ} & = \sqrt{1 + σ_{i 1}^{2}} \frac{P_{i t}^{M} (1 - P_{i t}^{M}) x_{i t 1}}{ϕ (Φ^{- 1} (P_{i t}^{M}))}, \frac{\partial Δ_{i t 2}}{\partial α} = x_{i t 2} \frac{1 - e^{- λ_{i t}^{M}}}{1 - (1 + λ_{i t}^{M}) e^{- λ_{i t}^{M}}}, \frac{\partial Δ_{i t 1}}{\partial ζ_{1 l}} = \frac{σ_{i 1}^{2}}{\sqrt{1 + σ_{i 1}^{2}}} h_{i l} Φ^{- 1} (P_{i t}^{M}), \\ \frac{\partial Δ_{i t 2}}{\partial ζ_{2 l}} & = - σ_{i 2}^{2} h_{i l}, \frac{\partial Σ_{i}^{- 1}}{\partial ζ_{1 l}} = \frac{\partial D_{i}^{- 1}}{\partial ζ_{1 l}} R_{i}^{- 1} D_{i}^{- 1} + D_{i}^{- 1} R_{i}^{- 1} \frac{\partial D_{i}^{- 1}}{\partial ζ_{1 l}}, \frac{\partial Σ_{i}^{- 1}}{\partial ζ_{2 l}} = \frac{\partial D_{i}^{- 1}}{\partial ζ_{2 l}} R_{i}^{- 1} D_{i}^{- 1} + D_{i}^{- 1} R_{i}^{- 1} \frac{\partial D_{i}^{- 1}}{\partial ζ_{2 l}}, \\ \frac{\partial ρ_{i}}{\partial δ_{l}} & = 4 w_{i l} \frac{\exp (2 w_{i}^{T} δ)}{{(1 + \exp (2 w_{i}^{T} δ))}^{2}}, \frac{\partial Σ_{i}^{- 1}}{\partial δ_{l}} = D_{i}^{- 1} \frac{\partial R_{i}^{- 1}}{\partial δ_{l}} D_{i}^{- 1}, \end{aligned}

(29)

with

\frac{\partial D_{i}^{- 1}}{\partial ζ_{1 l}} = (\begin{array}{cc} - \frac{h_{i 1}}{σ_{i 1}} & 0 \\ 0 & 0 \end{array}), \frac{\partial D_{i}^{- 1}}{\partial ζ_{2 l}} = (\begin{array}{cc} 0 & 0 \\ 0 & - \frac{h_{i 1}}{σ_{i 2}} \end{array}), \frac{\partial R_{i}^{- 1}}{\partial δ_{l}} = (\begin{array}{cc} \frac{2 ρ_{i}}{{(1 - ρ_{i}^{2})}^{2}} \frac{\partial ρ_{i}}{\partial δ_{l}} & - \frac{1 + ρ_{i}^{2}}{{(1 - ρ_{i}^{2})}^{2}} \frac{\partial ρ_{i}}{\partial δ_{l}} \\ - \frac{1 + ρ_{i}^{2}}{{(1 - ρ_{i}^{2})}^{2}} \frac{\partial ρ_{i}}{\partial δ_{l}} & \frac{2 ρ_{i}}{{(1 - ρ_{i}^{2})}^{2}} \frac{\partial ρ_{i}}{\partial δ_{l}} \end{array}) .

The second derivatives of the log-likelihood are analytically intractable. However, the sample covariance of the individual score functions provides a consistent estimator of the information matrix, relying only on first derivatives. Consequently, the Quasi-Newton method can be used to solve the likelihood equations. Unlike the classical Newton-Raphson method, the Quasi-Newton approach approximates the Hessian matrix using gradient information, thereby avoiding the need for exact second derivatives.

θ^{(g + 1)} = θ^{(g)} + {[H {(θ; y)}^{- 1} \frac{\partial \log L (θ; y)}{\partial θ}]}_{θ = θ^{(g)}},

where

H (θ; y) = \sum_{i = 1}^{N} \frac{\partial \log L (θ; y_{i})}{\partial θ} \frac{\partial \log L (θ; y_{i})}{\partial θ^{T}} .

5. Simulation studies

A simulation study was conducted to evaluate the performance of the OMHREM's parameter estimation. A total of 500 datasets were generated under its framework, with varying sample sizes (300, 500, and 700) and a fixed number of four repeated measurements per individual ( $n_{i} = 4$ ). Data were generated based on the marginal and dependence structures specified in equations (7) and (10), and group and time were incorporated as covariates to evaluate the accuracy and robustness of the model.

For each time point $t = 1, . . ., n_{i}$ , the linear predictors for different components of the models are defined as follows:

\begin{aligned} x_{i t 1}^{T} γ & = γ_{0} + γ_{1} {Group}_{i} + γ_{2} {Time}_{i t}, \end{aligned}

(30)

\begin{aligned} x_{i t 2}^{T} α & = α_{0} + α_{1} {Group}_{i} + α_{2} {Time}_{i t}, \end{aligned}

(31)

\begin{aligned} h_{i}^{T} ζ_{1} & = ζ_{10}, h_{i}^{T} ζ_{2} = ζ_{20}, w_{i}^{T} δ = δ_{0}, \end{aligned}

(32)

where

{Time}_{i t} \sim N (0, 1)

and

{Group}_{I}

was a binary indicator (0 or 1), with approximately equal sample sizes across groups. The true parameter values employed for the simulation were given by:

γ = (0.7, - 0.5, - 0.7), α = (3, 0.1, 0.4), ζ_{1} = ζ_{2} = 0.1, and ν = 0.7

To evaluate how accurately the proposed model estimates parameters, we used the following criteria: the mean of the estimated parameters (MEAN), percent relative bias (PRB), average standard error (ASE), and empirical SD. These criteria are defined as:

\begin{aligned} MEAN (ξ) & = \frac{1}{M} \sum_{m = 1}^{M} {\hat{ξ}}_{m} \underline{\underline{def}} \bar{\hat{ξ}}, \end{aligned}

\begin{aligned} PRB (ξ) & = \frac{\bar{\hat{ξ}} - ξ}{ξ} \times 100, \end{aligned}

\begin{aligned} SE (ξ) & = \frac{1}{M} \sum_{m = 1}^{M} se ({\hat{ξ}}_{m}), \end{aligned}

\begin{aligned} SD (ξ) & = Stdev ({\hat{ξ}}_{1}, \dots, {\hat{ξ}}_{M}), \end{aligned}

where

ξ_{m}

denotes the true value of a model parameter,

{\hat{ξ}}_{m}

is the parameter estimate obtained from the mth replication, and

se (ξ_{m})

is its corresponding estimated SE.

To provide an overall summary of the estimation performance across all parameters, we report the average of the absolute PRBs (APRBs), the ASEs, and the average of the SDs (ASD). These summary measures are calculated as follows:

APRB = \frac{1}{6} \sum_{l = 1}^{6} | PRB (ξ_{l}) |, A S E = \frac{1}{6} \sum_{l = 1}^{6} SE (ξ_{l}), ASD = \frac{1}{6} \sum_{l = 1}^{6} SD (ξ_{l}) .

Lastly, to evaluate the accuracy of the estimated random effects covariance matrix ${\hat{Σ}}_{i} = Σ$ , we first considered the MLEs of the parameters ( $ξ$ and $υ$ ) for the covariance matrix for each simulated dataset. Then, the mean of these estimated parameters was calculated, similar to $\bar{\hat{ξ}}$ . Finally, we calculated the Frobenius norm (FROB) using the mean of the estimated parameters ( $\bar{\hat{ξ}}$ and $\bar{\hat{ν}}$ ), which is defined as:

t r (Σ (\bar{\hat{ξ}}, \bar{\hat{ν}}) Σ^{- 1} - I)^{2},

where

Σ (\bar{\hat{ξ}}, \bar{\hat{ν}})

is the random effects covariance matrix calculated using the mean of the estimated parameters (

\bar{\hat{ξ}}

and

\bar{\hat{ν}}

Table 3 summarizes the simulation results for MLE performance across sample sizes N = 300, 500, 700. The estimated parameters are generally close to their true values. As the sample size increases, the SEs, SDs, and PRBs of the estimates diminish, suggesting increased precision and lower bias. The overall performance metrics (APRB, ASE, and ASD) also show substantial decreases with larger sample sizes, indicating improved estimation accuracy. Furthermore, the Frobenius norms decrease as N increases, demonstrating that the estimated covariance matrices converge towards the true structure.

Table 3.

Simulation study results from the OMHREM model are evaluated using the mean estimate (MEAN), PRB, average SE, SD of 500 estimates, frobenius norm (FROB), APRB, and ASE and ASD.

	N = 300		N = 500		N = 700
	Mean	PRB	Mean	PRB	Mean	PRB
	SE	SD	SE	SD	SE	SD
$γ_{0}$	0.704	0.606	0.692	−1.205	0.700	0.037
(0.7)	0.148	0.156	0.114	0.128	0.095	0.102
$γ_{1}$	−0.507	1.424	−0.498	−0.446	−0.503	0.678
(−0.5)	0.158	0.171	0.122	0.135	0.101	0.115
$γ_{2}$	−0.712	1.671	−0.698	−0.353	−0.708	1.125
(−0.7)	0.180	0.180	0.138	0.137	0.116	0.121
$α_{0}$	2.968	−1.071	2.973	−0.890	2.988	−0.393
(3.0)	0.123	0.132	0.094	0.102	0.077	0.093
$α_{1}$	0.101	0.877	0.098	−1.528	0.098	−1.564
(0.1)	0.099	0.149	0.076	0.126	0.060	0.119
$α_{2}$	0.400	−0.121	0.405	1.194	0.400	0.020
(0.4)	0.052	0.053	0.040	0.039	0.033	0.034
$ξ_{10}$	0.089	−10.915	0.094	−6.459	0.102	1.657
(0.1)	0.087	0.087	0.067	0.068	0.056	0.057
$ξ_{20}$	0.083	−16.714	0.093	−6.990	0.097	−2.965
(0.1)	0.058	0.049	0.045	0.038	0.038	0.032
$δ_{0}$	0.681	−2.752	0.679	−2.965	0.698	−0.280
(0.7)	0.121	0.106	0.091	0.075	0.077	0.066
1000 × FROB $(\hat{Σ})$	1.433		0.863		0.152
APRB	4.017		2.448		0.969
ASEM_(ASD)	0.114_(0.120)		0.087_(0.094)		0.073_(0.082)

PRB: percent relative bias; APRB: absolute percent relative bias; SE: standard error; ASE: average standard error; SD: standard deviation; OMHREM: overall marginalized hurdle random effects model; ASD: average standard deviation.

Table 4 summarizes the model convergence across sample sizes of N = 300, 500, and 700. The convergence rate is defined as the ratio of 500 successful convergences to the total number of simulations. For N = 300, a total of 930 runs were required, while 785 and 710 runs were needed for N = 500 and 700, respectively. This indicates that the convergence rate improves as the sample size increases, leading to more stable model performance.

Table 4.

The convergence rate of the overall marginalized hurdle random effects model (OMHREM) is defined as the ratio of 500 converged runs to the total number of iterations needed.

Sample size	300	500	700
Convergence	0.538	0.637	0.704
Number of runs	930	785	700

We conducted another simulation study to compare the proposed OMHREM with the standard HREM incorporating random effects.² The HREM was implemented using the glmmTMB R package, a widely utilized tool for such models. Model performance was evaluated based on the Akaike information criterion (AIC) and the maximized log-likelihood value (MLLV). As summarized in Table 5, the OMHREM consistently yielded higher MLLVs and lower AIC values across all sample sizes. These results demonstrate a clear advantage in model fit, confirming the superiority of our method when the OMHREM is the true model.

Table 5.

The average AIC and maximized log-likelihood were calculated over various sample sizes and 500 simulated datasets.

	OMHREMs		GlmmTMB
Sample size	AIC	Max. loglike	AIC	Max. loglike
300	5054.04	−2518.02	5106.017	−2545.01
500	8423.26	−4202.63	8484.91	−4234.46
700	11732.2	−5857.10	11836.17	−5910.09

Higher maximized loglikelihood and lower AIC scores indicate better model performance. AIC: Akaike information criterion; OMHREM: overall marginalized hurdle random effects model.

Table 6.

Simulation study results from the OMHREM model with heteroskedastic covariate are evaluated using the mean estimate (MEAN), PRB, ASE, ASD of 500 estimates, frobenius norm (FROB), average APRB, and ASE and ASD.

	N = 500		N = 700
	Mean	PRB	Mean	PRB
	SE	SD	SE	SD
$γ_{0}$	0.701	0.208	0.703	0.459
(0.7)	0.113	0.120	0.094	0.098
$γ_{1}$	−0.497	−0.517	−0.506	1.115
(−0.5)	0.119	0.126	0.101	0.110
$γ_{2}$	−0.705	0.669	−0.700	0.043
(−0.7)	0.145	0.143	0.122	0.128
$α_{0}$	2.989	−0.380	3.000	0.011
(3.0)	0.085	0.101	0.071	0.086
$α_{1}$	0.114	13.681	0.098	−2.257
(0.1)	0.095	0.105	0.080	0.097
$α_{2}$	0.398	−0.592	0.401	0.152
(0.4)	0.036	0.037	0.030	0.033
$ξ_{10}$	−0.220	9.758	−0.210	5.102
(−0.2)	0.107	0.111	0.089	0.088
$ξ_{11}$	0.117	16.741	0.113	12.565
(0.1)	0.147	0.150	0.123	0.122
$ξ_{20}$	−0.113	12.517	−0.106	6.476
(−0.1)	0.060	0.059	0.050	0.048
$ξ_{21}$	−0.192	−3.939	−0.196	−1.89
(−0.2)	0.079	0.076	0.066	0.061
$δ_{0}$	0.702	0.238	0.707	0.995
(0.7)	0.102	0.106	0.085	0.092
1000 × FROB $(\hat{Σ})$	2.869		1.02
APRB	5.386		2.825
ASE_(ASD)	0.099_(0.103)		0.083_(0.088)

PRB: percent relative bias; APRB: absolute percent relative bias; SE: standard error; ASE: average standard error; SD: standard deviation; ASD: average standard deviation; OMHREM: overall marginalized hurdle random effects model.

It should be noted, however, that while the OMHREM and HREM are members of the same hurdle model family, they employ different likelihood formulations. Therefore, the AIC and log-likelihood comparisons presented here should be interpreted with caution as approximate indicators of empirical fit, rather than an exact information-theoretic comparison across equivalent likelihoods.

Further simulation studies were conducted to compare the performance of the OMHREMs under the same marginal model specifications, (30) and (31), and heteroskedastic random effects covariances. Similar to the homogeneous case, we generated 500 datasets from OMHREMs incorporating heterogeneous random effects covariance structures, specified as follows:

\begin{aligned} h_{i}^{T} ξ_{1} & = ξ_{10} + {group}_{i} ξ_{11}, \end{aligned}

\begin{aligned} h_{i}^{T} ξ_{2} & = ξ_{20} + {group}_{i} ξ_{21}, \end{aligned}

where

(ξ_{10}, ξ_{11}, ξ_{20}, ξ_{21}) = (- 0.2, 0.1, - 0.1, 0.2) .

Table 6 summarizes the simulation results for the MLE performance of OMHREMs with heterogeneous $Σ_{i}$ across sample sizes of N = 500 and 700. The estimated parameters are generally close to their true values, with accuracy improving as the sample size increases. The SE, SD, and PRB, along with overall performance metrics (APRB, ASE, and ASD), all tend to decrease with larger sample sizes, further confirming enhanced estimation precision. Additionally, the Frobenius norms decrease as N increases, suggesting that the estimated covariance matrices converge towards the true structures.

6. Analysis of Lupus data

This study examines the number of hospitalizations as a longitudinal response variable to identify factors that influence hospitalization frequency among patients with severe SLE, using data that are heavily zero-inflated.

6.1 Model fit

The OMHREMs are compared to existing models where the covariance structure is allowed to vary based on gender, age, and CCI for each subject over time. Table 7 details the models used in this comparison. Model 0 is an OMHREM with a homogeneous random effects covariance matrix. Models 1–3 are OMHREMs with random effects covariance structures that depend on gender, gender and age, and gender and CCI, respectively. Model 4 is the overall MZIPREM with a homogeneous random effects covariance matrix, as proposed by Lee et al.¹⁵

Table 7.
The models for $h_{i}^{T} ξ_{2}$ and $w_{i}^{T} δ$ are presented for the SLE data.

Model description $h_{i}^{T} ξ_{1}$ and $h_{i}^{T} ξ_{2}$ $w_{i}^{T} δ$

Model 0 OMHREM $ξ_{j 0}$ $δ_{0}$

Model 1 OMHREM $ξ_{j 0} + ξ_{j 1} {gender}_{i}$ $δ_{0}$

Model 2 OMHREM $ξ_{j 0} + ξ_{j 1} {gender}_{i}$ $δ_{0} + δ_{1} \log (Age)_{i}$

Model 3 OMHREM $ξ_{j 0} + ξ_{j 1} {gender}_{i}$ $δ_{0} + \frac{δ_{1} {CCI}_{i}}{10}$

Model 4 MZIPREM $ξ_{j 0}$ $δ_{0}$

Model description	$h_{i}^{T} ξ_{1}$ and $h_{i}^{T} ξ_{2}$	$w_{i}^{T} δ$
Model 0	OMHREM	$ξ_{j 0}$	$δ_{0}$
Model 1	OMHREM	$ξ_{j 0} + ξ_{j 1} {gender}_{i}$	$δ_{0}$
Model 2	OMHREM	$ξ_{j 0} + ξ_{j 1} {gender}_{i}$	$δ_{0} + δ_{1} \log (Age)_{i}$
Model 3	OMHREM	$ξ_{j 0} + ξ_{j 1} {gender}_{i}$	$δ_{0} + \frac{δ_{1} {CCI}_{i}}{10}$
Model 4	MZIPREM	$ξ_{j 0}$	$δ_{0}$

OMHREM: overall marginalized hurdle random effects model; SLE: systemic lupus erythematosus; MZIPREM: marginalized zero-inflated Poisson random effects model.

Table 8 shows the maximized log-likelihood and AIC for each model. The OMHREMs (Models 0–3) demonstrate superior performance compared to Model 4. Among the OMHREMs, Model 3 has the lowest AIC, suggesting it provides the best fit to the data.

Table 8.

Maximized log likelihoods and akaike information criterions (AICs) for models.

Model	0	1	2	3	4
Max.loglik	−25114.068	−25110.151	−25105.445	−25105.428	−25118.504
AIC	50250.136	50248.302	50238.890	50238.856	50259.008

Table 9 presents the MLEs and their associated SEs for Models 0–4. The results indicate that most covariates across all models are statistically significant at the 5% level, suggesting strong associations between age, gender, CCI, and SLE hospitalizations. It should be noted that Model 4 is an MZIPREM, where the zero model distinguishes between structural zeros and Poisson zeros. Unlike conditional models, all models presented here directly estimate marginal population-average effects. By better accounting for complex between-subject heterogeneity and within-subject dependence, this innovative framework provides clinicians with more readily interpretable metrics.

Table 9.

Maximum likelihood estimates of the parameters of models 1–4 for SLE data. Standard errors are displayed in parentheses.

	Model 0	Model 1	Model 2	Model 3	Model 4
Zero model
$γ_{0}$ (Int.)	−1.361^∗ (0.179)	−1.404^∗ (0.181)	−1.211^∗ (0.244)	−1.343^∗ (0.176)	−1.565^∗ (0.227)
$γ_{1}$ (log(Age))	0.520^∗ (0.049)	0.529^∗ (0.050)	0.482^∗ (0.048)	0.520^∗ (0.049)	0.423^∗ (0.063)
$γ_{2}$ (Gender)	0.200^∗ (0.050)	0.232^∗ (0.054)	0.209^∗ (0.049)	0.215^∗ (0.040)	0.213^∗ (0.064)
$γ_{3}$ (CCI/10)	−1.168^∗ (0.112)	−1.218^∗ (0.112)	−1.175^∗ (0.115)	−1.269^∗ (0.123)	−0.986^∗ (0.142)
Overall model
$α_{0}$ (Int.)	1.858^∗ (0.117)	1.897^∗ (0.121)	1.732^∗ (0.119)	1.882 (0.118)	1.938^∗ (0.129)
$α_{1}$ (log(Age))	−0.304^∗ (0.031)	−0.317^∗ (0.032)	−0.272^∗ (0.032)	−0.308^∗ (0.031)	−0.516^∗ (0.033)
$α_{2}$ (Gender)	−0.095^∗ (0.033)	−0.123^∗ (0.040)	−0.128^∗ (0.030)	−0.135^∗ (0.032)	−0.167^∗ (0.031)
$α_{3}$ (CCI/10)	0.863^∗ (0.079)	0.978^∗ (0.078)	0.909^∗ (0.079)	0.943^∗ (0.086)	1.218^∗ (0.074)
Random effects covariance
$ξ_{10}$ (Int.)	−0.592^∗ (0.031)	−0.427^∗ (0.067)	−1.454^∗ (0.241)	−0.792^∗ (0.056)	−0.596^∗ (0.049)
$ξ_{11}$ (Gender)		−0.202^∗ (0.076)	0.239^∗ (0.066)	0.700^∗ (0.151)
$ξ_{20}$ (Int.)	−0.182^∗ (0.024)	−0.129^∗ (0.050)	−0.649^∗ (0.172)	−0.214^∗ (0.039)	−0.155^∗(0.022)
$ξ_{21}$ (Gender)		−0.064 (0.055)	0.125^∗ (0.046)	0.105 (0.104)
$δ_{0}$ (Int.)	−0.704^∗ (0.046)	−0.770^∗ (0.095)	−0.781^∗ (0.101)	−0.767^∗ (0.102)	−0.086 (0.049)
$δ_{1}$ (log(Age) or CCI/10)			0.091 (0.111)	0.075 (0.112)

CCI: Charlson's comorbidity index; SLE: systemic lupus erythematosus.

^∗ Statistical significance at the 95% confidence level.

Based on the model selection criteria, Model 3 was identified as the optimal model for our analysis. The estimated equations for the marginal probability of hospitalization and the overall marginal mean, derived from Model 3, are presented as follows:

\begin{aligned} logit ({\hat{p}}_{i t}^{M}) & = - 1.343 + 0.520 \log ({Age}_{i t}) + 0.215 {Gender}_{i} - 1.269 CCI / 10_{i t} \end{aligned}

(33)

\begin{aligned} \log ({\hat{μ}}_{i t}^{M}) & = 1.882 - 0.308 \log ({Age}_{i t}) - 0.135 {Gender}_{i} + 0.945 CCI / 10_{i t} \end{aligned}

(34)

Based on (33) and (34), the estimated probability of non-hospitalization increased with age, was higher for females than males, and decreased with CCI. Conversely, the estimated log of the overall mean count decreased with age, was lower for females than males, and increased with CCI.

The estimated SD of the random effects in the logit model was 0.912 for women $({\hat{σ}}_{i 1} = \exp (- 0.792 + 0.7))$ and 0.453 for men $({\hat{σ}}_{i 1} = \exp (- 0.792))$ . For the log-linear model, the corresponding estimates were 0.897 for women $({\hat{σ}}_{i 2} = \exp (- 0.214 + 0.105))$ and 0.807 for men $({\hat{σ}}_{i 2} = \exp (- 0.214))$ .

For a subject with an average log-transformed age (log Age) of 3.637, the estimated correlation between the random effects was calculated as 0.135 using the following formula:

{\hat{ρ}}_{i} = \frac{\exp (2 (- 0.767 + 0.075 \bar{\log Age})) - 1}{\exp {2 (- 0.767 + 0.075 \bar{\log Age})} + 1} .

Based on equation (16), the estimated random effects covariance matrices for women and men with $\bar{\log Age}$ are:

{\hat{Σ}}_{females} = (\begin{array}{cc} 0.832 & 0.111 \\ 0.111 & 0.804 \end{array}), {\hat{Σ}}_{males} = (\begin{array}{cc} 0.205 & 0.049 \\ 0.049 & 0.652 \end{array}) .

The results indicate that the random effects covariance matrices differ between genders. Females show larger variances for both zero and positive count components, suggesting greater subject-level variation in both outcomes. Conversely, males tend to have lower variability but a weaker covariance in the positive count component.

7. Conclusion

This paper proposes an overall marginalized Poisson HREMs (OMHREMs) for the analysis of longitudinal zero-inflated count data. The OMHREMs framework consists of three submodels: a marginal hurdle model capturing population-averaged responses for the binary zero outcome and the positive counts, a dependence model employing a hurdle structure with random effects to account for within-subject heterogeneity, and an overall mean model utilizing a loglinear structure to model the direct impact of covariates on the logarithm of the overall marginal mean. A key advantage of OMHREMs over conventional hurdle models is their capacity to provide direct population-level interpretations of the marginal mean, a feature absent in traditional approaches that restricts their ability to assess overall covariate effects. Furthermore, the random effects covariance matrix is assumed to be heteroskedastic, allowing for subject-specific variability.

Model parameters are estimated using a Quasi-Newton algorithm. Following this, the marginal likelihood is calculated by evaluating integrals by Gauss-Hermite quadrature across the random effects. The resulting algorithm demonstrates both computational efficiency and robustness.

Simulation results demonstrate that as the sample size increases, the SEs, SDs, and PRBs of the estimates decrease, showing convergence toward the true parameter. The decreasing Frobenius norms suggest that improved accuracy and convergence towards the true covariance structure.

In the analysis of the SLE data, the OMHREM provided the best fit among the models considered. Within this framework, the log of the overall mean count was negatively associated with age, lower for females, and positively associated with the CCI. Conversely, the probability of non-hospitalization showed a positive association with age, was higher in females, and was negatively associated with the CCI.

Although Gauss-Hermite quadrature is frequently applied for marginal likelihood in low-dimensional settings, its computational cost grows exponentially with increasing dimensionality due to the rapid increase in quadrature points.¹¹ To overcome this limitation, alternative approaches such as Monte Carlo integration or extensions within a Bayesian framework utilizing MCMC methods for posterior inferences can be considered.

One limitation of our proposed model is the restrictive assumption of the Poisson regression model that the mean and variance of the count variables are equal. When overdispersion is present, the Poisson model may lead to underestimated SEs. To address this, alternative distributions such as the NB are used to allow for greater variability.²⁶ Addressing this limitation is an ongoing work.

Footnotes

ORCID iD

Keunbaik Lee

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (KRF) funded by the Ministry of Education, Science and Technology (RS-2024-00407300 for Keunbaik Lee).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Lambert

. Zero-inflated poisson regression, with an application to defects in manufacturing. Technometrics 1992; 34: 1–14.

Min

Agresti

. Random effect models for repeated measures of zero-inflated count data. Stat Modelling 2005; 5: 1–19.

Feng

. A comparison of zero-inflated and hurdle models for modeling zero-inflated count data. J Stat Distrib Appl 2021; 8: 8. doi:10.1186/s40488-021-00121-4

Mullahy

. Specification and testing of some modified count data models. J Econom 1986; 33: 341–365.

Gilthorpe

Frydenberg

Cheng

, et al. Modelling count data with excessive zeros: the need for class prediction in zero-inflated models and the issue of data generation in choosing between zero-inflated and generic mixture models for dental caries data. Stat Med 2009; 28: 3539–3553.

Neelon

O’Malley

Normand

SLT

. A Bayesian model for repeated measures zero-infalted count data with application to outpatient psychiatrix service use. Stat Modelling 2010; 10: 421–439.

Buu

Tan

, et al. Statistical models for longitudinal zero-inflated count data with applications to the substance abuse field. Stat Med 2012; 31: 257–278.

Zhu

Luo

DeSantis

. Zero-inflated count models for longitudinal measurements with heterogeneous random effects. Stat Methods Med Res 2015; 26: 1774–1786.

Kang

Gaskins

Levy

, et al. A longitudinal Bayesian mixed effects model with hurdle conway-Maxwell-poisson distribution. Stat Med 2020; 40: 1336–1356.

10.

Hall

Zhang

. Marginal models for zero inflated clustered data. Stat Modelling 2004; 4: 161–180.

11.

Lee

Joo

Song

, et al. Analysis of zero-inflated clustered count data: a marginalized model approach. Comput Stat Data Anal 2011; 55: 824–837.

12.

Mwalili

Lesaffre

Declerck

. The zero-inflated negative binomial regression model with correlation for misclassification: an example in caries research. Stat Methods Med Res 2008; 17: 123–139.

13.

Long

Preisser

Herring

, et al. A marginalized zero-inflated poisson regression model with overall exposure effects. Stat Med 2014; 33: 5151–5165.

14.

Long

Preisser

Herring

, et al. A marginalized zero-inflated poisson regression model with random effects. Journal of the Royal Statistical Society: Series C 2015; 64: 815–830.

15.

Lee

Jang

Dey

. Overall marginalized models for longitudinal zero-inflated count data. 2025 arXiv preprint:2511.22223 2025.

16.

Fava

Petri

. Systemic lupus erythematosus: diagnosis and clinical management. J Autoimmun 2019; 96: 1–13.

17.

Kim

Jang

Cho

, et al. Cost-of-illness changes before and after the diagnosis of systemic lupus erythematosus: a nationwide, population-based observational study in Korea. Rheumatology 2025; 64: 180–187.

18.

Chen

Zhang

Xiang

, et al. Post-traumatic stress disorder and risk of systemic lupus erythematosus: Meta-analysis and Mendelian randomization study. J Psychosom Res 2025; 190: 112049.

19.

Han

Cho

Kang

, et al. Cardiovascular disease risk in Korean patients with systemic lupus erythematosus compared to diabetes mellitus and the general poplulation. Sci Rep 2025; 15: 3208.

20.

Jang

Rhee

Cho

, et al. Analysis of longitudinal lupus data using multivariate t-linear models. Stat Med 2025; 44: e10248.

21.

Charlson

Pompei

Ales

, et al. A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. J. Chronic Dis 1987; 40: 373–383.

22.

Tierney

Kadane

. Accurate approximations for posterior moments and marginal densities. J Am Stat Assoc 1986; 81: 82–86.

23.

Pinheiro

Bates

. Approximations to the log-likelihood function in the nonlinear mixed-effects model. J Comput Graph Stat 1995; 4: 12–35.

24.

Molas

Lesaffre

. Hurdle models for multilevel zero-inflated data via h-likelihood. Stat Med 2010; 29: 3294–3310.

25.

Park

Sim

Yang

, et al. Nonparametric Bayesian poisson hurdle random effects model: an application to temperature-suicide association study. Environ Ecol Stat 2025; 32: 579–601.

26.

Zhang

Kano

Tani

, et al. Hurdle modeling for defet data with excess zeros in steel manufacturing process. IFAC-PapersOnLine 2018; 51: 375–380.