Developing Two Penalized Spline Estimators for Semiparametric Poisson and Zero-inflated Poisson Models: Simulation and Application

Abstract

This article proposes two estimators for two semiparametric count regression models, namely semiparametric partially Poisson (SPPO) and semiparametric partially zero-inflated Poisson (SPZIP), via the penalized smoothing (Ps) spline and P-spline (Pb) estimations to address the common issue of nonparametric relationships between the response variable and covariates. Additionally, the SPZIP model incorporates a zero-inflation component to handle excess zeros in count data. Through extensive Monte Carlo simulations, we rigorously evaluate the performance of the proposed penalized spline estimators by comparing them against traditional parametric estimators using multiple statistical criteria, including the Akaike information criterion, Bayesian information criterion, deviance statistic, mean squared error and root mean squared error (RMSE). The results indicate that our estimators are more efficient than other estimators. Also, the SPZIP and SPPO models consistently outperform parametric (Poisson and zero-inflated Poisson) regression models, particularly in scenarios with high levels of zero inflation, demonstrating their superior ability to model complex data structures. Our findings highlight the practical utility of these models for analyzing complex count data with excess zeros and nonparametric covariate effects. A real-life data application further demonstrates the capabilities of the SPPO and SPZIP models, demonstrating their ability to provide more accurate and adaptable statistical analysis in challenging data settings.

AMS Subject Classification: 62G08, 62J20, 62J05

Keywords

Biochemistry semiparametric regression models count data proportion data P-spline excess zeros

1. Introduction

A model is just a simplified representation of reality or the real-world. Generally, models are classified as deterministic or probabilistic. The models used to examine the functional relationship between dependent variables and independent variables are known as regression models and can be classified into three types, which are parametric, nonparametric, and semiparametric regression models. The regression function represents this functional relationship. So, if the regression function is known, it produces a notable model known as the parametric regression model. If this relationship is unknown, it produces a regression model known as a non-parametric regression model. The combination of parametric regression models and nonparametric regression models produces semiparametric regression models. This means that this model is used in cases where the parametric assumptions are not met or the non-parametric model is not the most efficient.

One of the most popular parametric regression models is the linear model, which is used when the response variable follows a normal distribution. This model is predicated on several assumptions and a set of finite parameters, β. Since the response variable does not always follow a normal distribution, the linear model is not applicable in many practical scenarios or real-life applications. For instance, response variables may occasionally be measured on a count, binary, ordinal, non-negative scale or other limitations. Nelder and Wedderburn^[1] presented the concept of generalized linear models (GLMs) as a way to get around these problems. Thus, GLMs are extensions of the linear model under these limitations, provided that the response variable distribution is a member of the exponential family of distributions and that it is also a parametric regression model. In the case of parametric regression models, the count regression models with the problem of excess zeros are considered one of the most common GLMs due to their use in many fields, such as insurance, public health, epidemiology and psychology. For instance, Abonazel et al., ^[2] Algamal et al., ^[3] Akram et al., ^[4] Dikheel and Jouda,^[5] Abdelwahab et al., ^[6] Zeeshan et al.^[7] and Shahsavari et al.^[8]

In recent years, there has been considerable interest in semiparametric regression models, which have a long history in statistics. The main reason they are considered is that sometimes the relationships between the response and explanatory variables are very heterogeneous within the same model because some of the relationships are linear, while others are more complex to classify. More explicitly, some variables are parametrically related, while others are nonparametrically related. Numerous semiparametric regression models exist, including the partially linear model (PLM), a significant generalization of the linear model that permits the majority of predictors to be modelled linearly while one enters the model nonparametrically and is a special case of the additive model. The generalized partial linear model (GPLM), one of the most popular semiparametric regression models, extends the PLM when the dependent variable follows any distribution within the exponential family except the normal distribution. In another sense, GPLMs are an extension of the GLMs by blending nonparametric models for some covariates. The GPLM has been applied in several areas of knowledge, such as toxicology, biology, sciences and statistics.

Recently, some literature has reviewed estimation and inference methods for the GPLM, which can handle both linear and nonlinear relationships simultaneously. For example, Green and Yandell ^[9] introduced the semiparametric GLM in which they added a nonparametric component to the linear predictor. Carroll et al.^[10] extended semiparametric GLMs to partially linear single-index models, enabling the modelling of intricate relationships. They also developed backfitting algorithms for estimation. Härdle et al.^[11] introduced kernel-based methods for estimating semiparametric GLMs and provided statistical tests for model selection. Wood ^[12] applied semiparametric Poisson (PO) regression to model daily deaths in Chicago, demonstrating the practical use of these models. Additionally, they provided a comprehensive introduction to generalized additive models, a special case of semiparametric GLMs, using R dataset, Lam et al.^[13] developed a semiparametric zero-inflated Poisson (ZIP) regression model using sieve maximum likelihood and B-splines to address the challenges of excess zeros in count data. Liang et al.^[14] studied semiparametric GLMs for longitudinal data and proposed an empirical likelihood-based inference method for GPLMs, providing a robust and efficient approach to statistical inference. Taylan et al.^[15] applied semiparametric GLMs to analyse survival data and focused on the theoretical foundations of parameter estimation for GPLMs, specifically using B-splines and continuous optimization techniques. De Vera^[16] introduced the semiparametric PO regression model; the parametric and nonparametric components in this model were estimated by penalized maximum likelihood and backfitting algorithm. Manalaysay and Barrios^[17] presented a similar work where they focused on the principal component of the PO regression model, and the model is estimated via the backfitting algorithm. Yousof and Gad^[18] introduced a robust Bayesian framework for estimating and conducting inference within GPLMs, effectively integrating linear and nonlinear components to capture complex relationships in data. They focused on Bayesian estimation and inference of the model parameters using multivariate conjugate prior distributions under the square error loss function. For more details about prior and posterior distributions, see the studies of Seliem et al.^[19] For more details, the studies of Müller,^[20] Luts and Wand,^[21] Lukusa et al., ^[22] Fang et al.,^[23] Wurm and Rathouz,^[24] Ye et al.^[24] Boente et al.^[25] and Boente et al.^[26] can also be reviewed.

Rahman et al.^[27] proposed efficient inference methods for GPLMs, a class of models that flexibly accommodate linear and nonlinear relationships. Their novel estimation approach achieves semiparametric efficiency, enhancing statistical accuracy. El-sayed et al.^[28] demonstrated the effectiveness of a novel B-spline Speckman estimator for PLMs. Their results showed improved performance compared to traditional methods, highlighting the advantages of using B-splines to estimate the nonparametric component. Abonazel and Gad^[29] introduced a robust partial residual estimation approach for semiparametric PLMs. Their results showed that this method outperforms traditional techniques, particularly in settings affected by outliers. For additional discussions on outlier treatment in different models, see, for example, Abonazel and Gad,^[29] Seliem,^[30] and Soliman et al.^[31]. Meanwhile, Ibacache-Pulgar et al.^[32] proposed the semiparametric zero-inflated negative binomial (ZINB) model, which extends the standard ZINB framework. Similarly, Araújo et al.^[33] examined determinants of academic performance among undergraduate business students, focusing on the number of failed courses. To analyse their data, they applied a semiparametric ZINB regression model, incorporating factors, such as employment status, dissatisfaction with affirmative action scholarships and the challenges of combining work with study. For more details, the studies of Shao and Wang,^[34] Prataviera et al., ^[35] Millard and Kanfer,^[36] Vasconcelos et al., ^[37] Cardozo et al.^[38] and Vasconcelos et al.^[39] can also be reviewed. To overcome the difficulties of evaluating proportional data with excessive zeros and intricate correlations, Seliem et al.^[40] suggested an innovative semiparametric zero-inflated Beta (SPZIBE) regression model via penalized smoothing (Ps) spline and P-spline (Pb) estimators. By accounting for both linear and nonlinear effects, this model provides a versatile and interpretable method that makes it possible to identify the elements causing zero inflation. The authors demonstrated the model’s superior performance through extensive simulations and real-life applications, showcasing its potential to outperform existing parametric models in terms of model fit and predictive accuracy. This research significantly contributes to the field of statistical modelling by providing a valuable tool for analysing various types of data, including those encountered in political science, economics, and social sciences. This work aims to discuss the semiparametric partially Poisson (SPPO) and the semiparametric partially zero-inflated Poisson (SPZIP) regression models and propose two estimators using the Pb and Ps estimators to estimate the nonlinear effects of covariates.

The structure of this article is as follows: the parametric regression models, the PO regression model and the ZIP regression model, are briefly described in Section 2. In section 3, we propose Ps and Pb estimators to estimate the nonparametric part of semiparametric regression models, that is, the SPPO and SPZIP regression models. A simulation study is given in Section 4 to illustrate the advantages of the proposed estimators. Section 5 demonstrates the applicability of the proposed estimators through a real-world data application. Finally, Section 6 offers concluding remarks and discusses potential future research directions.

2. Methodology

This section aims to present the PO and ZIP regression models and the maximum likelihood estimator (MLE) for both models.

2.1. PO Regression Model

Regression model analysis is considered one of the most powerful techniques used to study the relationship between two or more variables in various fields and is widely used for both prediction and interpretation purposes. Consider another regression modelling scenario where the response variable of interest is not normally distributed. In this situation, the response variable represents the count of some relatively rare events. The simplest probability distribution for count data is the PO; the distribution is skewed to the right with restriction ‘variance equal to the mean’. So, the PO regression model is not a safe strategy when data show over- or under-dispersion. In a broader sense, the cases in which the PO regression model is appropriate are when the response variable is an observed count, which follows the PO distribution since the response variable’s values are non-negative integers. The regression coefficients in the PO regression model are estimated using the MLE. Since the PO distribution is a member of the exponential family. So, the density function of the exponential family is defined as

f (y_{i}, θ_{i}, ϕ) = \exp [\frac{y_{i} θ_{i} - b (θ_{i})}{a_{i} (ϕ)} + c (y_{i}, ϕ)]; i = 1, 2, \dots, n,

where, in each case, a(·), b(·) and c(·) are specific functions; the function a_i(ϕ) has the form a_i(ϕ) = ϕp_i, where p_i is a known prior weight, usually 1, θ represents the link function and is called the canonical parameter, which is a function of the mean (μ) of the distribution, b(θ) is a function of the canonical parameter or the cumulant, which is also a function of the mean because θ is a function of mean, ϕ is a dispersion parameter that plays a role in defining the variance of y and c(y_i, ϕ) is a function of observation y and the dispersion or scale parameter ϕ and indicates the normalization term. Assume that the counts, y_i(i = 1, 2, …, n), are produced independently using the PO distribution, where the probability mass function (p.m.f.) of the response variable is provided by

f_{p} (y_{i}; λ) = \frac{λ^{y_{i}} e^{- λ}}{y_{i}!} = \frac{μ_{i}^{y_{i}} e^{- μ_{i}}}{y_{i}!}; y_{i} = 0, 1, \dots

(2.1)

Equation (2.1) can be rewritten in exponential family form as follows:

\begin{matrix} f_{p} (y_{i}; μ_{i}) = \frac{\exp \{y_{i} \log (μ_{i}) - μ_{i} - \log (y_{i}!)\}}{y!}; \\ E (y_{i}) = \frac{d b (θ_{i})}{d (θ_{i})} = b^{(1)} (θ_{i}) and V a r (y_{i}) = \frac{d^{2} b (θ_{i})}{d (θ_{i}^{2})} = a (ϕ) b^{(2)} (θ_{i}), \end{matrix}

where b⁽¹⁾ and b⁽²⁾ refer to the first and second derivatives $μ_{i} > 0, θ_{i} = \log (μ_{i}), b (θ_{i}) = e_{i}^{θ}, a (ϕ) = 1 and c (y_{i}, ϕ) = - \log (y_{i}!)$ . All GLMs consist of three components: (a) a response variable distribution, (b) a linear predictor that involves the regressor variables and (c) a link function g(·) that connects the linear predictor to the mean of the response variable as follows:

\log (μ_{i}) = g (μ_{i}) = x_{i}^{T} β = θ_{i}; E (y_{i}) = μ_{i} = g^{- 1} (θ_{i}),

(2.2)

where $x_{i}^{T} = (x_{i 1},, x_{i p})$ is defined as the i^th row of the X matrix with dimension n × p; for the i^th subject (i = 1, …, n) corresponding to the count model, x_i1 = 1 if the model contains the intercept term and β = (β₁, β₂, β_p) ^T is a vector of unknown regression parameters. Then, from Equation (2.1), the log-likelihood function of the PO regression model is as follows:

l (β) = \sum_{i = 1}^{n} [y_{i} \log (μ_{i}) - μ_{i} - \log (y_{i}!)]

(2.3)

To estimate the unknown regression parameters of the PO regression model, the most popular approach is the MLE. For the canonical link, we have $η_{i} = g (μ_{i}) = g [E (y_{i})] = x_{i}^{T} β = θ_{i}$ . Therefore, the score function is as follows:

S (β) = \frac{\partial l}{\partial β} = \frac{\partial l}{\partial θ_{i}} \frac{\partial θ_{i}}{\partial μ_{i}} \frac{\partial μ_{i}}{\partial η_{i}} \frac{\partial η_{i}}{\partial β}; S (β) = \frac{\sum_{i = 1}^{n} (y_{i} - μ_{i})}{g^{'} (μ_{i}) V_{i}} x_{i},

(2.4)

where $E {[\sum_{i = 1}^{n} (y_{i} - μ_{i})]}^{2} = V_{i}$ . In addition, the Fisher information matrix can be expressed as follows:

E [\frac{\partial^{2} l (β)}{\partial β \partial β^{T}}] = - E {[\frac{\partial l (β)}{\partial β}]}^{2} = - \frac{E [\sum_{i = 1}^{n} {(y_{i} - μ_{i})}^{2}]}{{[g^{'} (μ_{i}) V_{i}]}^{2}} x_{i} x_{i}^{T} = - \sum_{i = 1}^{n} w_{i} x_{i} x_{i}^{T}

Iterative techniques are used to acquire the solution since Equation (2.4) is nonlinear in β. The iteratively re-weighted least squares (IRLS) method is a popular example of such a process. With r iterations, let β_r be the estimated value of MLE of β. This can be expressed as

β_{r + 1} = β_{r} - {(\frac{1}{ρ_{r}}) S (β)|}_{β_{r}} = {(X^{T} W X)}^{- 1} X^{T} W ℧,

(2.5)

where

X = [\begin{matrix} x_{11} & x_{12} & \dots & x_{1 p} \\ x_{21} & x_{22} & \dots & x_{2 p} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ x_{n 1} & x_{n 2} & \dots & x_{n p} \end{matrix}]

and $℧ = X β_{r} + (y - μ_{r}) g^{'} (μ_{r}) = η_{r} + (y - μ_{r}) g^{'} (μ_{r})$ is called a working vector, W is a diagonal matrix with diagonal elements $w_{i} = {[g^{'} (μ_{i})]}^{2} V_{i}^{- 1}$ and both obtained using Fisher scoring iterative procedure and $I = E [\frac{\partial^{2} l (β)}{\partial β \partial β^{T}}]$ . Subsequently, the estimated coefficients are defined as

\hat{β} = {(X^{T} \hat{W} X)}^{- 1} X^{T} \hat{W} \hat{℧}

2.2. ZIP Regression Model

As discussed above, the PO regression model assumes that the response variable’s conditional mean and conditional variance are equal, and violating this condition leads to the problem of excess zeroes. Hence, the PO regression model is not suitable for representing the data in the presence of this problem. Consequently, the zero-inflated regression models are the best models to represent these data. Zero-inflated regression models are statistical models specifically designed for counting data with excess zeros. They essentially model the data in two parts: one process explains ‘true zeros’ and another explains ‘excess zeros’. A common example of these models is the ZIP model. The ZIP model is a mixture of two distributions. The zeros are assumed to arise from two different states, the first is a degenerate distribution at zero (i.e., zero inflation) that occurs with probability π and produces structural zeros, while the second is a PO distribution that occurs with probability 1 – π_i and is called sampling zeros, which occur by chance. The parametric ZIP can, then, be formulated as follows:

y_{i} = \{\begin{matrix} 0, & with probability π_{i} \\ f_{P} (y_{i}; λ_{i}), & with probability 1 - π_{i}, \end{matrix}

where y_i is a response variable, and f_p(y_i, λ) set the space here denotes PO distribution with parameter λ_i > 0. Hence, if π_i = 0, the ZIP model reduces to a PO distribution, as in Equation (2.5). More clearly, the ZIP model can be presented as follows:

\Pr (y_{i} = v_{i}; λ_{i}, π_{i}) = \{\begin{matrix} π_{i} + (1 - π_{i}) e^{- λ_{i}}, & v_{i} = 0 \\ (1 - π_{i}) \frac{λ_{i}^{v_{i}} e^{- λ_{i}}}{v_{i}!}, & v_{i} = 1, 2, \dots; 0 \leq π_{i} \leq 1, \end{matrix}

(2.6)

where λ_i is the PO mean corresponding to the susceptible population for the i^th individual (i = 1, …, n), and π_i is the probability of belonging to the nonsusceptible population. For the ZIP distribution, the mean and variance are as follows:

μ_{i} = E (y_{i}) = (1 - π_{i}) λ_{i} and V a r (y_{i}) = μ_{i} + (\frac{π_{i}}{1 - π_{i}}) μ_{i}^{2}

To use the parametric ZIP model with covariates, Lambert^[41] suggested the following joint models for λ and π:

\log (λ_{i}) = x_{i}^{T} β and \log (\frac{π_{i}}{1 - π_{i}}) = z_{i}^{T} ω,

(2.7)

where $z_{i}^{T} = (z_{i 1},, z_{i q})$ is defined as the i^th row of the Z matrix with dimension n × q, and ω = (ω₁, ω_q) ^T .^[42] Now, consider an observed sample (y₁, x₁, t₁),…, (y_n, x_n, t_n) of n independent observations, where each observed response is denoted by y_i, i = 1, 2, n. Equation (2.6) then yields the parametric partial log-likelihood function for the ZIP regression model, given the observed sample, with regard to the parameter vectors β and ω:

l (λ, π) = \prod_{y_{i} = 0} [π + (1 - π) e^{- λ}] I [y_{i} = 0] \times \prod_{y_{i} > 0} [(1 - π) \frac{λ^{y_{i}} e^{- λ}}{y_{i}!}] I [y_{i} > 0],

(2.8)

where I[△] is the indicator function of the set △. Substituting Equation (2.7) into Equation (2.8) yields

l (β, ω) = \sum_{y_{i} = 0} \log [e^{z_{i}^{⊤} ω} + e^{- e^{x_{i}^{⊤} β}}] + \sum_{y_{i} > 0} [y_{i} x_{i}^{⊤} β - e^{x_{i}^{⊤} β}] - \sum_{i = 1}^{n} \log [1 + e^{z_{i}^{⊤} ω}] - \sum_{y_{i} > 0} \log (y_{i}!)

(2.9)

Iterative techniques are used to acquire the solution because Equation (2.9) is nonlinear in ω. IRLS or expectation-maximization algorithms are examples of such procedures that are frequently used.^[43]

3. Proposed Estimators

This article develops estimation for two semiparametric count regression models: the SPPO regression model and the SPZIP regression model. The proposed estimation framework utilizes Ps spline and Pb approaches to capture potential nonparametric covariate effects. While both models address nonparametric relationships, the SPZIP regression model incorporates a zero-inflation component to handle excess zeros in count data. The semiparametric models offer more flexibility than their parametric counterparts by relaxing strict distributional assumptions. Many real-life phenomena exhibit response variables that do not follow a normal distribution. When parametric models are insufficient to describe the relationship between dependent and independent variables, semiparametric regression models provide a more flexible alternative. For example, the GPLM assumes that the conditional expectation of the response variables, given the covariates, can be represented as

E (Y ∣ X, T) = G [X β + m (T)]; V (Y ∣ X, T) = σ^{2} V (μ),

(3.1)

where the explanatory variables are split into two vectors: X, an n × p design matrix of parametric covariates, and T, a q-variate random vector of continuous covariates that enter the model nonparametrically, β is a p × 1 vector of finite-dimensional unknown parameters, σ² is the dispersion parameter and m(·) is an unknown smooth function estimated nonparametrically. The model in Equation (3.1) assumes a linear relationship between the response variable and X and a nonlinear relationship with T. Here, G(·) : R → R is a known link function, chosen based on the range of the response variable. Several approaches can be used to estimate the nonparametric component in semiparametric regression models, including kernel smoothing, local polynomial regression and cubic spline. By combining parametric and nonparametric components, GPLMs offer more flexibility than traditional GLMs. However, this increased flexibility comes with additional complexity and computational challenges.

3.1. SPPO Regression Model

In some cases, we need to extend the parametric PO regression model with the link function in Equation (2.2) into a SPPO regression model with a partially linear link function or a broader systematic component for λ as follows:

\log (μ_{i}) = x_{i}^{T} β + m (t_{i}),

(3.2)

where m(·) is a smoothing function related to the continuous explanatory variable with the nonlinear effect that is controlled nonparametrically. The SPPO regression model can be seen as a generalization of the PO regression model presented in Equations (2.1) and (2.2) by including a nonparametric function. Then, Equations (2.1) and (3.2) define the SPPO regression model to represent linear and nonlinear effects jointly. Regarding the nonparametric component, m(·), in Equation (3.2), there are several approaches to estimate it, including cubic smoothing spline, B-spline and truncated power basis (TPB).^[44] For example, the smooth function m(·) is approximated by the TPB, m_TPB(t, B). The equation for a spline of degree d with K_ξ knots is given by

m_{T P B} (t, B) = B_{0} + \sum_{j = 1}^{d} B_{j} t^{j} + \sum_{k = 1}^{K_{ξ}} B_{k} {(t - ξ_{K_{ξ}})}_{+}^{d},

(3.3)

where $B_{0}, B_{1}, \dots, B_{d + K_{ξ}}$ are unknown regression coefficients, are the TPB basis ${(t - ξ_{K_{ξ}})}_{+}^{d}$ functions and Ψ₊ = max(Ψ, 0). When fitting a d degree spline by penalized least squared, none of the polynomial coefficients is penalized. The TPB functions, denoted as m_TPB(t, B), are defined in Equation (3.3) as a linear combination of unknown regression coefficients $B_{0}, B_{1}, \dots, B_{d + K_{ξ}}$ and TPB functions ${(t - ξ_{K_{ξ}})}_{+}^{d}$ . The TPB function, ${(t - ξ_{K_{ξ}})}_{+}^{d}$ , is zero when $t < ξ_{K_{ξ}}$ and equals ${(t - ξ_{K_{ξ}})}^{d}$ when $t \geq ξ_{K_{ξ}}$ . A cubic spline is the most common spline used in practice. Let $ξ_{1} < \dots < ξ_{K_{ξ}}$ be a set of ordered points (called knots) continuous in some interval (Ψ, Ω). Define h₁(t) = 1, h₂(t) = t, h₃(t) = t², h₄(t) = t³, h_j(t) = (t − ξ_j₋₄)³ for j = 5, …, K_ξ + 4. The functions $\{h_{1}, \dots, h_{K_{ξ} + 4}\}$ form a basis for the set of cubic splines at these knots, called the TPB. Thus, any cubic spline m with these knots can be written as

m_{T P B} (t, B) = \sum_{j = 1}^{K_{ξ} + 4} B_{j} h_{j} (t)

Therefore, the PO mean log(μ) in Equation (3.2) can be expressed as

\log (μ_{i}) = x_{i}^{T} β + \sum_{j = 1}^{K_{ξ} + 4} B_{j} h_{j} (t_{i})

(3.4)

Another example, the smooth function m(t) is approximated by a spline function m_BS(t, κ) that can be expressed as a linear combination of the B-spline basis functions as follows:

m_{B S} (t, κ) = \sum_{s = 1}^{K_{ξ} + d + 1} τ_{s} B_{s}^{d} (t, κ),

(3.5)

where τ_s are unknown regression coefficients, s = 1, …, K_ξ + d + 1, and B^d(t, κ) are the B-spline functions and can be expressed as follows: let ξ₀ = Ψ and $ξ_{K_{ξ} + 1} = Ω$ , where the knots from ξ₁ to $ξ_{K_{ξ}}$ are called inner knots and Ψ and Ω are called boundary knots. Define new knots $κ_{1} < \dots < κ_{M}$ such that $κ_{1} \leq κ_{2} \leq \dots \leq κ_{M} \leq ξ_{0}, κ_{M} + s = ξ_{s}$ for $s = 1, \dots, K_{ξ}$ and $ξ_{K_{ξ} + 1} = κ_{K_{ξ} + M + 1} \leq \dots \leq κ_{K_{ξ} + 2 M}$ . The choice of extra knots is arbitrary; usually, one takes $κ_{1} = κ_{2} = \dots = κ_{M} = ξ_{0}$ and $κ_{K_{ξ} + 1} = κ_{K_{ξ} + M + 1} = \dots = κ_{K_{ξ} + 2 M}$ . Given a set of K_ξ knots, we define the B-spline function of degree zero recursively as follows:

B_{s}^{0} (t, κ) = \{\begin{matrix} 1 & for κ_{s} \leq t \leq κ_{s + 1} \\ 0 & otherwise \end{matrix}

(3.6)

Where $B_{s}^{0} (x) \equiv 0 if κ_{s} = κ_{s + 1}$ . However, the general B-spline of degree d, represented by the function $B_{s}^{d} (t, k)$ , is defined for a set of K_ξ knots as follows:

B_{s}^{d} (t, κ) = \frac{t - κ_{s}}{κ_{s + d} - κ_{s}} B_{s}^{d - 1} (t, κ) + \frac{κ_{s + d + 1} - t}{κ_{s + d + 1} - κ_{s + 1}} B_{s + 1}^{d - 1} (t, κ); s = 1, \dots, K_{ξ} + d + 1

(3.7)

Note that additional 2d + 2 knots are necessary to build the full B-spline of degree d. Briefly, a complete B-spline matrix of degree d for n observations based on K_ξ knots has dimension n × (K_ξ + d + 1). The total number of knots for the construction of the B-spline will be K_ξ + 2d + 1. Then, the number of B-splines in the regression is equal K_ξ + d + 1, for more details see the work of Goepp et al.^[45] Therefore, the PO mean log(μ) in Equation (3.2) according to the B-spline regression can be expressed as

\log (μ_{i}) = x_{i}^{T} β + \sum_{s = 1}^{K_{ξ} + d + 1} τ_{s} B_{s}^{d} (t_{i}, κ)

(3.8)

Let (Ψ, Ω) be an interval and let $\{ξ_{1}, \dots, ξ_{K_{ξ}}\}$ be K_ξ points such that $Ψ < ξ_{1} < \dots < ξ_{K_{ξ}} < Ω$ . A continuous function m on [Ψ, Ω] is a cubic spline with knots $\{ξ_{1}, \dots, ξ_{K_{ξ}}\}$ if the following two assumptions are true: Assumption 1: the nonparametric function, m, is a cubic polynomial over the intervals (ξ_i, ξ_i₊₁), …, (K_ξ₋₁, K_ξ). Assumption 2: The nonparametric function, m, has continuous first and second derivatives at the knots. Consequently, the natural cubic spline can be defined if m⁽²⁾(Ψ) = m⁽²⁾(Ω) = 0. The natural cubic spline is cubic spline with the constraint that they are linear in their tails beyond the boundary knots (Ψ, ξ₁) and $(ξ_{K_{ξ}}, Ω)$ . To estimate the semiparametric model in Equations (3.4) and (3.8), the Ps spline log-likelihood function is maximized as follows:

l (β, m) = l (β) - \frac{ℵ}{2} J (m); J (m) = \int_{Ψ}^{Ω} {[m^{(2)} (t)]}^{2} d t,

(3.9)

where $ℵ \geq 0, J (m)$ is a penalty (or regularization) term, and (Ψ and Ω) are the minimum and maximum values of t, respectively (i.e., Ψ = t₁ < … < t_n = Ω).

The concept of Ps splines was originally introduced by O’Sullivan.^[46] However, Eilers and Marx^[47] pioneered the use of B-splines with different penalties, a technique now known as Pbs. Pbs combine regression on B-splines with a discrete roughness penalty to smooth scatterplots. B-splines are constructed from polynomial pieces joined at specific points called knots. Once the knots are defined, B-splines of any desired degree can be computed recursively. The penalty term for Equation (3.9) can be written as follows:

J (m) = \int_{Ψ}^{Ω} {[m^{(2)} (t)]}^{2} d t = Υ_{s}^{T} M_{s} Υ_{s},

(3.10)

where M_s is a positive semidefinite penalty matrix that is _s × _s . ^[28] However, Eilers and Marx^[48] showed that the integrated square of the k^th derivative of m(t) is well approximated by a penalty on finite differences of the coefficients Υ _s with much less effort, that is,

\int_{Ψ}^{Ω} {[m^{(k)} (t)]}^{2} d t = Υ_{s}^{T} P_{k} Υ_{s},

(3.11)

where $P_{k} = D_{k}^{T} \times D_{k}$ and D_k of dimension (n − k) × n. Finally, the logarithm of the Pb likelihood function may be written as follows using Equations (3.9) and (3.11):

l (β, m) = \sum_{i = 1}^{n} [y_{i} \log (μ_{i}) - μ_{i} - \log (y_{i}!)] - \frac{ℵ}{2} Υ_{s}^{T} P_{k} Υ_{s}

(3.12)

Thus, we have that β and m are estimated by maximizing the logarithm of the penalized likelihood function in Equation (3.12). Regarding the banded matrix, D_k can be computed recursively, where D₁ has dimension (n − 1) × n, with k_i_, _i = 1, k_i_, _i ₊₁ = 1 and all other elements are 0, mostly k = 2 or k = 3 is used. The penalized splines technique employs fewer knots while demonstrating greater robustness to knot placement compared to traditional smoothing splines.^{[45, 49]} The generalized additive models for location, scale and shape (GAMLSS) package enhances this approach through automatic knot selection, effectively balancing model complexity and efficiency. While Ruppert et al.^[49] recommended maintaining 4–5 observations between knots, large datasets benefit from an upper limit of 20–40 knots^[50] to ensure computational efficiency.

3.2. SPZIP Regression Model

Consider a semiparametric link function for λ, the partly linear link function is one possibility giving the joint models with

\log (λ_{i}) = x_{i}^{T} β + m (t_{i}) and \log (\frac{π_{i}}{1 - π_{i}}) = z_{i}^{T} ω

(3.13)

Table 1.

The Generated Variables Under Different Scenarios.

Scenarios	Scenario I	Scenario II
m(t)	0.6 + sin(3πt)	sin(6πt) + cos(πt)
t	U(−0.5, 0.5)	U(−0.5, 1)
x1	N(0.25, 1)	U(−0.5, 1)
x ₂	U(−0.5, 1)	U(−0.5, 1)
(β₁, β₂)	(0.6, 0.6)	(0.5, 0.5)

where t is an observable continuous covariate, and m(·) is an unknown smooth function related to continuous covariates with nonlinear effect that is modelled nonparametrically. The model specified by Equations (2.6) and (2.7) is an extension of the class of generalized linear mixed models, while the model specified by Equations (2.6) and (3.13) is an extension of the class of generalized partially linear mixed models by incorporating a nonparametric component, m(t), alongside parametric terms. Let ${\{y_{i}\}}_{i = 1}^{n}$ be an independent random sample from a ZIP distribution. Combining the model specifications from Equations (2.6), (3.11) and (3.13), we derive the semiparametric penalized log-likelihood function for the SPZIP regression model

\begin{matrix} l (β, ω, m) = \sum_{y_{i} = 0}^{n} \log [e^{z_{i}^{⊤} ω} + e^{- e^{x_{i}^{⊤} β}}] + \sum_{y_{i} > 0} [y_{i} x_{i}^{⊤} β - e^{x_{i}^{⊤} β}] \\ - \sum_{i = 1}^{n} \log [1 + e^{z_{i}^{⊤} ω}] - \sum_{y_{i} > 0} \log (y_{i}!) - \frac{ℵ}{2} Υ_{s}^{⊤} P_{d} Υ_{s} \end{matrix}

(3.14)

This semiparametric penalized log-likelihood function incorporates fixed-effects parameter vectors β ∈ Rp and ω ∈ Rq for the count and zero-inflation components, respectively, and a nonparametric smooth function m(·) to model nonlinear covariate effects, where its smoothness is enforced by the penalty term $\frac{ℵ}{2} Υ_{s}^{⊤} P_{d} Υ_{s}$ , with $ℵ > 0$ being the smoothing parameter that balances model fit and smoothness, P_d is a positive semi-definite penalty matrix and Υ _s (or equivalently γ_s) represents the basis coefficients for the spline representation of m(·). This comprehensive formulation provides a unified framework for modelling the zero-inflation probability via the logistic regression component $z_{i}^{⊤} ω$ , the PO count processes through $\exp (x_{i}^{⊤} β)$ and nonlinear covariate effects via the penalized spline term m(·).

4. Simulation Study

To assess the performance of parametric and semiparametric PO and ZIP models, a Monte Carlo simulation study was conducted. The nonparametric part was performed via Ps and Pb estimators, specifically the SPPO and SPZIP with (Ps and Pb) estimators. The R software with the GAMLSS package was employed to simulate data under various scenarios, including sample sizes of 150, 250 and 400 and zero-inflation levels of 30 per cent and 50 per cent as in Table 2. For each scenario, 1,000 datasets were generated, with the response variable y_i following ZIP(λ, π) distribution and covariates (x₁, x₂), and t specified as in Table 1.

Table 2.

The Design of the Experiment.

Name of Factors	Notations	Values
Number of explanatory variables	N	2
Sample size	n	150, 250 and 400
Zero-inflation ratios	ZI per cent	30 per cent and 50 per cent

Table 3.

Parametric and Semiparametric Link Functions for Count Data Models.

Models	Parameter	Parametric	Semiparametric
Poisson	λ	log(λ) = Cβ; C = [x₁, x₂, t]	log(λ) = Xβ + m(t)
ZIP	λ	log(λ) = Xβ	log(λ) = Xβ + m(t)
	π	$\log (\frac{π}{1 - π}) = Z ω$	$\log (\frac{π}{1 - π}) = Z ω + m (t)$

Additionally, we explored the use of sine–cosine functions for the nonparametric component, as outlined in Table 1. Furthermore, following the work of Rahman et al., ^[27] alternative nonparametric functions can be considered for future research. The deviance statistic (DVS), mean squared error (MSE), Akaike information criterion (AIC) and Bayesian information criterion (BIC) were used to assess each model’s performance. Average estimates (AEs), MSE and root mean squared error (RMSE) were used to evaluate the goodness-of-fit of the estimated parameters ${\hat{β}}_{j} and \hat{m} (\cdot)$ . The MSE values for ${\hat{β}}_{j} and \hat{m} (\cdot)$ at each iteration run v = 1, 2, …, 1, 000 were calculated as

M S E_{v} (\hat{m} (t)) = \frac{1}{n} \sum_{i = 1}^{n} {[\hat{m} (t_{i}) - m (t_{i})]}^{2}; M S E_{v} (\hat{β}) = \frac{1}{N} \sum_{j = 1}^{N} {[{\hat{β}}_{j} - β_{j}]}^{2}

Zero-inflated count data pose significant challenges in statistical modelling due to the excess zeros. To address this issue, we evaluated the performance of various count models, including parametric, PO and ZIP, and semiparametric, SPPO and SPZIP. Simulated data were generated under different scenarios with varying levels of zero inflation and sample size as in Table 3. The models were fitted to these data and evaluated using multiple criteria, including AIC, BIC, DVS, MSE, mean absolute error (MAE) and RMSE. Lower values of these metrics indicate better model fit.

Our findings consistently demonstrate the superior performance of the SPZIP regression model, particularly when using the Pb estimator (ZIP.pb). This model achieved the lowest values for all evaluation metrics, including AIC, BIC and MSE, and exhibited enhanced performance with larger sample sizes and higher levels of zero inflation. The SPZIP model’s flexibility in capturing both parametric and nonparametric components of zero-inflated count data enables more precise parameter estimation and improved model fit, especially in complex data structures. As shown in Tables 4 –9, the SPZIP regression model with the Pb estimator consistently outperformed traditional parametric methods, reinforcing its position as a robust and effective statistical tool for analysing zero-inflated count data.

Figures 1 and 2 provide a clear visual comparison of the fitted values for the estimators of m(t) under different scenarios (n = 400, ZI = 30 per cent and 50 per cent). The SPZIP regression model consistently demonstrates the closest alignment with the true function across both scenarios, specifically the SPZIP with Pb estimator. This visual evidence corroborates the findings from the evaluation metrics and further solidifies the superior performance of the SPZIP regression model in capturing the complex nonparametric structure of the data. The analysis shows that the SPZIP regression model outperforms other models in several key aspects that can be summarized as follows:

SPZIP model’s accuracy: The SPZIP regression model’s fitted curves consistently exhibit a closer proximity to the true function compared to the other models. This is particularly evident in the regions of higher curvature and variability.

Table 4.

Simulation Results for Different Estimators When n = 150 in Scenario I.

							Est.		Parametric Part		Nonparametric Part
Model	Estimator	ZI(per cent)	AIC	DVS	BIC	MSE	$\hat{β_{1}}$	$\hat{β_{2}}$	MAE	RMSE	MAE	RMSE
PO	MLE		781.498	775.498	790.529	3.094	1.090	0.810	1.858	2.683	-	-
SPPO	Pb		606.635	590.889	630.338	1.908	0.600	0.600	1.046	1.461	0.370	0.384
SPPO	Ps	30 per cent	618.743	606.743	636.807	1.998	0.630	0.620	1.038	1.529	0.358	0.435
ZIP	MLE		704.765	692.765	722.829	2.250	1.520	1.090	1.936	2.519	-	-
SPZIP	Pb		514.231	494.798	543.484	0.938	0.560	0.560	0.955	1.46	0.324	0.351
SPZIP	Ps		520.248	504.248	544.332	0.963	0.580	0.580	0.956	1.489	0.339	0.396
PO	MLE		701.259	695.259	710.291	2.939	0.760	0.560	1.932	2.966	-	-
SPPO	Pb		583.728	569.402	605.295	2.218	0.610	0.600	1.694	2.259	0.712	0.720
SPPO	Ps	50 per cent	593.432	581.432	611.496	2.295	0.630	0.630	1.685	2.289	0.713	0.751
ZIP	MLE		550.494	538.494	568.558	1.830	1.510	1.080	1.931	2.514	-	-
SPZIP	Pb		422.521	403.778	450.737	0.954	0.620	0.610	0.873	1.349	0.290	0.327
SPZIP	Ps		427.076	411.076	451.161	0.970	0.650	0.630	0.873	1.479	0.309	0.372

Note: AIC: Akaike information criterion; BIC: Bayesian information criterion; DVS: deviance statistic; MAE: mean absolute error; MLE: maximum likelihood estimator; MSE: mean squared error; Pb: P-spline; Ps: penalized smoothing; RMSE: root mean squared error; SPPO: semiparametric partially Poisson; SPZIP: semiparametric partially zero-inflated Poisson; ZIP: zero-inflated Poisson. The bold values are the most significant.

Table 5.

Simulation Results for Different Estimators When n = 250 in Scenario I.

							Est.		Parametric Part		Nonparametric Part
Model	Estimator	ZI(per cent)	AIC	DVS	BIC	MSE	$\hat{β_{1}}$	$\hat{β_{2}}$	MAE	RMSE	MAE	RMSE
PO	MLE		1303.715	1297.715	1314.279	3.126	0.860	1.070	1.852	2.563	-	-
SPPO	Pb		1015.516	998.655	1045.205	1.939	0.610	0.610	0.986	1.269	0.360	0.366
SPPO	Ps	30 per cent	1033.544	1021.544	1054.673	2.013	0.600	0.630	0.979	1.391	0.336	0.403
ZIP	MLE		1200.615	1188.615	1221.744	2.428	1.160	1.480	1.936	2.505	-	-
SPZIP	Pb		862.259	841.276	899.204	0.955	0.550	0.550	0.162	0.203	0.085	0.108
SPZIP	Ps		872.546	856.546	900.718	0.975	0.540	0.580	0.500	0.597	0.216	0.259
PO	MLE		1162.46	1156.46	1173.02	2.940	0.600	0.750	1.920	2.800	-	-
SPPO	Pb		983.36	967.98	1010.46	2.270	0.600	0.610	1.630	2.090	0.700	0.710
SPPO	Ps	50 per cent	997.97	985.97	1019.10	2.330	0.600	0.630	1.630	2.130	0.700	0.720
ZIP	MLE		934.49	922.49	955.62	1.930	1.150	1.480	1.930	2.500	-	-
SPZIP	Pb		707.23	686.86	743.09	0.970	0.600	0.610	0.120	0.170	0.050	0.070
SPZIP	Ps		715.15	699.15	743.32	0.990	0.590	0.630	0.450	0.560	0.180	0.230

Note: Note: AIC: Akaike information criterion; BIC: Bayesian information criterion; DVS: deviance statistic; MAE: mean absolute error; MLE: maximum likelihood estimator; MSE: mean squared error; Pb: P-spline; Ps: penalized smoothing; RMSE: root mean squared error; SPPO: semiparametric partially Poisson; SPZIP: semiparametric partially zero-inflated Poisson; ZIP: zero-inflated Poisson. The bold values are the most significant.

Table 6.

Simulation Results for Different Estimators When n = 400 in Scenario I.

							Est.		Parametric Part		Nonparametric Part
Model	Estimator	ZI(per cent)	AIC	DVS	BIC	MSE	$\hat{β_{1}}$	$\hat{β_{2}}$	MAE	RMSE	MAE	RMSE
PO	MLE		2074.496	2068.496	2086.47	3.11982	0.92	1.08	1.83	2.615	-	-
SPPO	Pb		1633.956	1615.911	1669.968	2.00119	0.6	0.6	0.996	1.311	0.357	0.362
SPPO	Ps	30 per cent	1663.129	1651.129	1687.078	2.06608	0.6	0.6	0.99	1.45	0.328	0.399
ZIP	MLE		1899.134	1887.134	1923.082	2.4246	1.23	1.42	1.862	2.531	-	-
SPZIP	Pb		1364.681	1342.329	1409.29	0.95808	0.56	0.55	0.175	0.214	0.096	0.115
SPZIP	Ps		1383.501	1367.501	1415.433	0.97822	0.56	0.55	0.573	0.69	0.231	0.274
PO	MLE		1864.576	1858.576	1876.551	2.959	0.66	0.80	1.928	2.871	-	-
SPPO	Pb		1589.640	1573.202	1622.447	2.343	0.61	0.61	1.639	2.139	0.694	0.698
SPPO	Ps	50 per cent	1612.409	1600.409	1636.358	2.392	0.61	0.61	1.633	2.205	0.671	0.708
ZIP	MLE		1478.309	1466.309	1502.258	1.962	1.22	1.42	1.861	2.530	-	-
SPZIP	Pb		1118.607	1096.797	1162.134	0.978	0.61	0.61	0.095	0.148	0.035	0.048
SPZIP	Ps		1133.124	1117.124	1165.056	0.991	0.61	0.61	0.513	0.630	0.202	0.238

Table 7.

Simulation Results for Different Estimators When n = 150 in Scenario II.

							Est.		Parametric Part		Nonparametric Part
Model	Estimator	ZI(per cent)	AIC	DVS	BIC	MSE	$\hat{β_{1}}$	$\hat{β_{2}}$	MAE	RMSE	MAE	RMSE
PO	MLE		713.935	707.935	722.967	2.996	0.760	0.870	1.920	3.245	-	-
SPPO	Pb		550.585	528.060	584.492	1.847	0.530	0.530	0.958	1.802	0.383	0.461
SPPO	Ps	30 per cent	630.915	618.915	648.979	2.376	0.600	0.630	1.442	2.535	0.608	0.729
ZIP	MLE		639.602	627.602	657.666	2.272	0.930	1.190	2.205	3.786	-	-
SPZIP	Pb		456.192	428.479	498.160	0.930	0.480	0.440	0.519	0.860	0.258	0.349
SPZIP	Ps		522.171	506.171	546.256	1.186	0.530	0.520	1.446	2.236	0.596	0.692
PO	MLE		637.188	631.188	646.220	2.808	0.600	0.620	1.860	3.357	-	-
SPPO	Pb		540.753	524.598	565.072	2.214	0.550	0.580	1.486	2.771	0.628	0.747
SPPO	Ps	50 per cent	583.544	571.544	601.608	2.488	0.590	0.640	1.615	2.992	0.715	0.863
ZIP	MLE		494.220	482.220	512.284	1.835	0.920	1.200	2.197	3.759	-	-
SPZIP	Pb		381.584	358.951	415.654	0.980	0.530	0.530	0.729	1.227	0.344	0.437
SPZIP	Ps		420.228	404.228	444.313	1.132	0.570	0.600	1.396	2.188	0.573	0.660

Table 8.

Simulation Results for Different Estimators When n = 250 in Scenario II.

							Est.		Parametric Part		Nonparametric Part
Model	Estimator	ZI(per cent)	AIC	DVS	BIC	MSE	$\hat{β_{1}}$	$\hat{β_{2}}$	MAE	RMSE	MAE	RMSE
PO	MLE		1182.056	1176.056	1192.621	2.994	0.620	1.030	1.829	3.112	-	-
SPPO	Pb		879.976	851.051	930.906	1.781	0.490	0.510	0.847	1.861	0.320	0.392
SPPO	Ps	30 per cent	1044.537	1032.537	1065.666	2.405	0.500	0.570	1.460	2.668	0.592	0.695
ZIP	MLE		1056.977	1044.977	1078.106	2.308	0.740	1.420	1.961	2.958	-	-
SPZIP	Pb		739.920	707.722	796.612	0.940	0.480	0.460	0.283	0.411	0.203	0.265
SPZIP	Ps		857.005	841.005	885.176	1.186	0.470	0.510	1.667	2.342	0.707	0.877
PO	MLE		1053.553	1047.553	1064.117	2.799	0.52	0.71	1.868	3.593	-	-
SPPO	Pb		851.012	826.264	894.587	2.080	0.48	0.5	1.398	2.956	0.562	0.669
SPPO	Ps	50 per cent	969.060	957.060	990.189	2.521	0.48	0.56	1.641	3.322	0.685	0.812
ZIP	MLE		817.678	805.678	838.806	1.883	0.75	1.4	1.973	3.014	-	-
SPZIP	Pb		603.494	572.765	657.599	0.954	0.5	0.5	0.188	0.276	0.142	0.231
SPZIP	Ps		689.625	673.625	717.797	1.133	0.49	0.56	1.528	2.209	0.632	0.769

Table 9.

Simulation Results for Different Estimators When n = 400 in Scenario II.

							Est.		Parametric Part		Nonparametric Part
Model	Estimator	ZI(per cent)	AIC	DVS	BIC	MSE	$\hat{β_{1}}$	$\hat{β_{2}}$	MAE	RMSE	MAE	RMSE
PO	MLE		1880.962	1874.962	1892.936	3.018	0.600	0.850	1.750	2.988	-	-
SPPO	Pb		1368.871	1337.063	1432.349	1.750	0.500	0.490	0.713	1.288	0.295	0.357
SPPO	Ps	30 per cent	1656.627	1644.627	1680.575	2.401	0.470	0.480	1.426	2.433	0.595	0.702
ZIP	MLE		1728.404	1716.404	1752.353	2.435	0.720	1.230	1.958	3.317	-	-
SPZIP	Pb		1161.520	1126.384	1231.643	0.952	0.460	0.440	0.195	0.303	0.131	0.169
SPZIP	Ps		1370.911	1354.911	1402.843	1.209	0.410	0.410	1.501	2.238	0.648	0.772
PO	MLE		1646.223	1640.223	1658.198	2.749	0.490	0.580	1.744	3.050	-	-
SPPO	Pb		1314.634	1285.795	1372.189	2.025	0.490	0.500	1.181	2.087	0.521	0.632
SPPO	Ps	50 per cent	1523.959	1511.959	1547.908	2.482	0.470	0.490	1.530	2.760	0.671	0.797
ZIP	MLE		1328.788	1316.788	1352.737	1.957	0.730	1.220	1.970	3.370	-	-
SPZIP	Pb		948.539	914.555	1016.362	0.969	0.500	0.500	0.155	0.248	0.098	0.164
SPZIP	Ps		1104.964	1088.964	1136.895	1.152	0.450	0.480	1.469	2.233	0.612	0.723

Notes: AIC: Akaike information criterion; BIC: Bayesian information criterion; DVS: deviance statistic; MAE: mean absolute error; MLE: maximum likelihood estimator; MSE: mean squared error; Pb: P-spline; Ps: penalized smoothing; RMSE: root mean squared error; SPPO: semiparametric partially Poisson; SPZIP: semiparametric partially zero-inflated Poisson; ZIP: zero-inflated Poisson. The bold values are the most significant.

Figure 1.

Note: Pb: P-spline; Ps: penalized smoothing; SPPO: semiparametric partially Poisson; SPZIP: semiparametric partially zero-inflated Poisson.

Figure 2.

Note: Pb: P-spline; Ps: penalized smoothing; SPPO: semiparametric partially Poisson; SPZIP: semiparametric partially zero-inflated Poisson.

Robustness to zero-inflation levels: The SPZIP regression model’s performance remains relatively stable across different levels of zero inflation (30 per cent and 50 per cent). This suggests its robustness in handling varying degrees of excess zeros.

Table 10.

Summary of BioChemists Dataset.

Variable	Description	Mean	SD
Y	The articles number published during the last 3 PhD years	1.693	1.926
x1	Factor indicating gender of student, with levels Men and Women	0.460	0.499
x ₂	Count of articles produced by Ph.D. mentor during last 3 years	8.767	9.484
T	The PhD student prestige	3.103	0.984

Figure 3.

Note: PO: semiparametric Poisson; ZIP: zero-inflated Poisson.

5. Empirical Application

In this section, we present a real-life application to evaluate the effectiveness of proposed estimators.

5.1. BioChemists Dataset

This section demonstrates the applicability of the SPZIP regression model through comparative analysis using real-life data. We utilize the biochemistry graduate students’ dataset (bioChemists) from R software, previously analysed by Al-Taweel and Algamal,^[51] to compare parametric regression models (PO and ZIP) with their semiparametric counterparts (SPPO and SPZIP). The semiparametric models incorporate a nonparametric component estimated using the Ps and Pb estimators. The dataset comprises 915 observations, with basic statistics reported in Table 10.

Before model fitting, variable selection was conducted. The predictor x₁ was excluded from nonparametric consideration due to its categorical predictor. Furthermore, based on a comparison of the coefficients of determination for the continuous variables x₂ and t, variable t exhibited a lower coefficient of determination, indicating a weaker linear relationship with the response variable. Consequently, variable t was deemed suitable for nonparametric treatment in the semiparametric models. Figure 3 shows the plots produced with two distributions for the dependent variables (count) in the ‘bioChemists’ data, where the proportion of zeros in the dependent variable was equal to 30 per cent. Figure 3 utilizes vertical lines topped with circles to represent the distribution fits.

Table 11.

Test of Zero-inflation in the BioChemists Dataset.

Model	Z-score	p value	Dispersion
PO	5.67	< .0001	1.856
	Observed zeros	Predicted zeros	Ratio
	275	188	68 per cent

Note: PO: Poisson.

Table 12.

LR Tests for Fitted Regression Models.

Test		Null Hypothesis	Test Statistic	p Value
SPZIP-Pb vs.	PO	H₀: The PO model is sufficient	164.81	< .0001
	ZIP	H₀: The ZIP model is sufficient	98.56	< .0001
SPZIP-Ps vs.	PO	H₀: The PO model is sufficient	125.50	< .0001
	ZIP	H₀: The ZIP model is sufficient	59.25	< .0001
SPZIP-Pb vs.	SPZIP-Ps	H₀: The SPZIP-Ps model is sufficient	39.30	< .0001

Note: Pb: P-spline; PO: Poisson; Ps: penalized smoothing; SPZIP: semiparametric partially zero-inflated Poisson; ZIP: zero-inflated Poisson.

Figure 3 highlights how important it is to use the ZIP distribution when the data suffer from the problem of excess zeros. When comparing the ZIP distribution and the PO distribution in representing the percentage of zeros in the data, it becomes clear that the PO distribution captured only 20 per cent as zeros, while the ZIP distribution captured 30 per cent as zeros. This emphasizes that the PO distribution is unsuitable for data with excess zeros and can lead to inaccurate representations. Furthermore, we applied the Z-score test to check for overdispersion where H₀: equidispersion versus H₁: true dispersion is greater than one. The test results in Table 11 also show that the Z-score statistic is 5.67 with a significant p value < .001, thus we can reject H₀, which means that the true dispersion is greater than one, meaning that the dataset has an overdispersion problem. The estimated dispersion value of 1.856 further supports this conclusion.

In addition, from Table 11, the amount of observed zeros is larger than the amount of predicted zeros, this means the model is under-fitting zeros, which indicates a zero inflation in the data. Table 12 presents likelihood ratio (LR) test results comparing the SPZIP regression model, estimated using both Pb and Ps estimators, to the PO and ZIP regression models. The null hypothesis (H₀) for each comparison posits that the simpler regression models (PO or ZIP) adequately explain the data, whereas the alternative hypothesis (H₁) asserts that the SPZIP regression model provides a significantly better fit. The results of the LR tests yield consistently high chi-squared test statistics and low p values, strongly supporting the superiority of the SPZIP regression model in capturing the underlying data patterns relative to both the PO and ZIP models. Furthermore, the analysis indicates that the SPZIP regression model estimated using the Pb estimator provides a significantly better fit than the SPZIP model estimated with the Ps estimator, with a highly significant LR test statistic (p < .0001). These findings suggest that the SPZIP regression model utilizing the Pb estimator captures the underlying data patterns with greater accuracy than all models considered.

As shown in Table 13, four statistical regression models were estimated, including two parametric models (PO and ZIP) and two semiparametric models (SPPO and SPZIP). The semiparametric models were employed to capture nonlinear relationships in the data. The models were compared based on three selection criteria, that is, AIC, DVS and MSE. The SPZIP regression model with Pb estimator achieved the lowest values for all three criteria, indicating it is the best-fitting model.

Table 13.

Results of Different Estimations of Parametric and Semiparametric Models for the BioChemists Dataset.

Model	Estimator	Systematic Components	DVS	AIC	BIC	MSE
Parametric models
PO	MLE	μ = exp[β₁x₁ + β₂x₂ + β₃t]	3343.02	3349.02	3363.49	1.55
ZIP	MLE	μ = exp[β₁x₁ + β₂x₂ + β₃t]	3276.78	3288.78	3317.69	1.32
Semiparametric models
SPPO	Pb	μ = exp[β₁x₁ + β₂x₂ + pb(t)]	3324.54	3330.55	3345.02	1.51
SPPO	Ps	μ = exp[β₁x₁ + β₂x₂ + ps(t)]	3317.64	3329.64	3358.56	1.51
SPZIP	Pb	μ = exp[β₁x₁ + β₂x₂ + pb(t)]	3174.07	3210.72	3298.998	1.17
SPZIP	Ps	μ = exp[β₁x₁ + β₂x₂ + pb(t)]	3217.52	3241.52	3300	1.18

Note: AIC: Akaike information criterion; BIC: Bayesian information criterion; DVS: deviance statistic; MLE: maximum likelihood estimator; MSE: mean squared error; Pb: P-spline; PO: Poisson; Ps: penalized smoothing; SPPO: semiparametric partially Poisson; SPZIP: semiparametric partially zero-inflated Poisson; ZIP: zero-inflated Poisson.

Figure 4.

Note: GAMLSS: generalized additive models for location, scale and shape; PO: Poisson; SPPO: semiparametric partially Poisson; SPZIP: semiparametric partially zero-inflated Poisson; ZIP: zero-inflated Poisson.

Accordingly, Table 14 presents the parameter estimates (Est.), standard errors (SEs), and p values for the SPZIP regression model via Pb estimator. At the 5 per cent significance level, all explanatory variables are statistically significant. Figure 4 presents a radar plot comparing the performance of four statistical regression models (PO, ZIP, SPPO and SPZIP) via Pb estimator across three evaluation metrics: RMSE, AIC weight (AIC_wt) and BIC weight (BIC_wt). Lower values generally indicate better model performance. The SPZIP regression model via Pb estimator appears to be the most balanced performer, with relatively low values across all three metrics. Overall, the SPZIP regression model seems to offer a good compromise between predictive accuracy and model complexity, making it a strong contender for further analysis. Additionally, Figure 5 presents a plot of the quantile residuals against the index of observations for the fitted SPZIP regression model. The quantile residuals generally fall within the expected range of [–3.5, 3.5], with only two outliers. Furthermore, the residuals exhibit no discernible patterns, suggesting that the model adequately captures the underlying data structure.

Table 14.

Estimation Results of the Semiparametric Partially Zero-inflated Poisson (SPZIP) Model via P-spline (Pb) Estimator.

Variables	Est.	SE	p value
Mu coefficients(μ)
x1	−0.1601	0.0465	< .0001
x ₂	0.0159	0.0020	< .0001
T	0.2260	0.0220	< .0001
Zero-inflation coefficients (ω)
x1	0.2800	0.2090	.1800
x ₂	−0.1740	0.0490	< .0001
T	−0.2610	0.1100	.0180

Note: Est.: parameter estimates; SE: standard errors.

Figure 5.

Quantile Residuals for the Fitted Semiparametric Partially Zero-inflated Poisson (SPZIP) Regression Model.

Conclusions

This article has introduced two extensions of count data models (SPPO and SPZIP). These models build upon the PO and ZIP regression models, respectively. The key innovation lies in incorporating a nonparametric component estimated via Pb and Ps estimators. This nonparametric component allows for more flexible modelling of the relationship between covariates and the response variable. A Monte Carlo simulation study, under different scenarios, and a real data application were conducted to evaluate the robustness of introduced estimators. We relied on five criteria (AIC, BIC, DVC, MAE and RMSE) to identify the best-performing estimator. Based on the results of the simulation study and a real-data application, the estimators consistently achieved lower values across all metrics, suggesting their superiority. The ZIP.pb estimator emerges as particularly effective for analysing complex count data. Notably, the SPZIP model utilizing the Pb estimator proved particularly effective for analysing complex count data.

Insert two statements before Reference section:

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The authors received no financial support for the research, authorship and/or publication of this article.

ORCID iDs

Mohamed R. Abonazel

Muhammad M. Seliem

References

Nelder

, Wedderburn

RWM.

Generalized linear models. J R Stat Soc 1972; 135(3): 370–384. doi: 10.2307/2344614.

Abonazel

, El-sayed

, Saber

OM.

Performance of robust count regression estimators in the case of overdispersion, zero-inflated, and outliers: simulation study and application to german health data. Communi Math Biol Neurosci 2021; 2021: 55. doi: 10.28919/cmbn/5658.

Algamal

, Lukman

, Abonazel

, . Performance of the ridge and Liu estimators in the zero-inflated Bell regression model. J Math 2022; 2022: 15. doi: 10.1155/2022/9503460.

Akram

, Abonazel

, Amin

, . A new Stein estimator for the zero-inflated negative binomial regression model. Concurr Comput Pract Exp 2022; 34(19): e7045. doi: 10.1002/cpe.7045.

Dikheel

, Jouda

HA.

Zero inflation poisson regression estimation by using genetic algorithm (GA). J Surv Fish Sci 2023; 10(3S): 5257–5268.

Abdelwahab

, Abonazel

, Hammad

, . Modified two-parameter Liu estimator for addressing multicollinearity in the Poisson regression model. Axioms 2024; 13(1): 46 doi: 10.3390/axioms13010046.

Zeeshan

, Khan

, Amanullah

, . A new modified biased estimator for zero inflated poisson regression model. Heliyon 2024; 10(3): e25341 doi: 10.1016/j.heliyon.2024.e25341.

Shahsavari

, Moghimbeigi

, Kalhor

, . Zero-inflated count regression models in solving challenges posed by outlier-prone data; an application to length of hospital stay. Arch Acad Emerg Med 2024; 12(1): e13 doi: 10.22037/aaem.v12i1.2074.

Green

, Yandell

BS.

Semi-parametric generalized linear models. In: Proceedings of the GLIM85 Conference , pages 44–55. Springer, 1985. ISBN 978-0-387-96277-1. doi: 10.1007/978-1-4615-7070-7_6.

10.

Carroll

, Fan

, Gijbels

, . Generalized partially linear single-index models. J Am Stat Assoc 1997; 92(438): 477–489. doi: 10.1080/01621459.1997.10474001

11.

Härdle

, Mammen

, Müller

Testing parametric versus semiparametric modeling in generalized linear models. J Am Stat Assoc 1998; 93(444): 1461–1474. doi: 10.1080/01621459.1998.10473806

12.

Wood

SN.

Generalized additive models: an introduction with R . Chapman and Hall/CRC, 2006. ISBN 978-1-584-88474-3. doi: 10.1201/9781420010404

13.

Lam

, Xue

, Cheung

YB.

Semiparametric analysis of zero-inflated count data. Biometrics 2006; 62(4): 996–1003. doi: 10.1111/j.1541-0420.2006.00575.x

14.

Liang

, Qin

, Zhang

, . Empirical likelihood-based inferences for generalized partially linear models. Scand J Stat 2009; 36(3): 433–443.

15.

Taylan

, Weber

, Liu

, . On the foundations of parameter estimation for generalized partial linear models with B-splines and continuous optimization. Comp Math Appl , 2010; 60(1): 134–143. doi: 10.1016/j.camwa.2010.04.040

16.

De Vera

Semiparametric Poisson regression model for clustered data. Philipp Stat 2014; 63(1): 33–42.

17.

Manalaysay

KCM

, Barrios

EB.

Semiparametric principal component Poisson regression on clustered data. Communi Statist-Simul Comput 2017; 46(2): 1546–1556. doi: 10.1080/03610918.2014.1002850

18.

Yousof

, Gad

Bayesian estimation and inference for the generalized partial linear model. Int J Stat Probab 2015; 4(2): 51–64. doi: 10.5923/j.ijps.20150402.03

19.

Seliem

, Kamel

, Taha

, . A note on prior selection in Bayesian estimation. Stat, Optim Inf Comput 2025; 13(2): 795–806. doi: 10.19139/soic-2310-5070-1752

20.

Müller

An introduction to the estimation of GPLMs and data examples for the R gplm package , 2014. R package vignette.

21.

Luts

, Wand

MP.

Variational inference for count response semiparametric regression. Bayesian Anal 2015; 10(4): 991–1023. doi: 10.48550/arXiv.1309.4199

22.

Lukusa

, Lee

, Li

CS.

Semiparametric estimation of a zero-inflated Poisson regression model with missing covariates. Metrika 2016; 79: 457–483. doi: 10.1007/s00184-015-0563-7

23.

Fang

, Kim

, Jung

Semiparametric kernel-based regression for evaluating interaction between pathway effect and covariate. J Agric Biol and Environ Stat 2018; 23(1): 129–152. doi: 10.1007/s13253-017-0317-2

24.

Wurm

, Rathouz

Semiparametric generalized linear models with the gldrm package. R J 2018; 10(1): 288–307. doi: 10.32614/RJ-2018-027

25.

, Wang

, Zou

, . A semi-nonparametric Poisson regression model for analyzing motor vehicle crash data. PLOS One 2018; 13(5): e0197338. doi: 10.1371/journal.pone.0197338

26.

Boente

, Rodriguez

, Vena

Robust estimators in a generalized partly linear regression model under monotony constraints. Test 2020; 29(1): 50–89. doi: 10.1007/s11749-019-00629-7.

27.

Rahman

, Luo

, Fan

, . Semiparametric efficient inferences for generalised partially linear models. J Nonparam Stat 2020; 32(3): 704–724. doi: 10.1080/10485252.2020.1790557

28.

El-sayed

, Abonazel

, Seliem

MM.

B-spline Speckman estimator of partially linear model. Int J Sci Sci Arch 2019; 4: 53–63. doi: 10.11648/j.ijssam.20190404.12

29.

Abonazel

, Gad

AA.

Robust partial residuals estimation in semiparametric partially linear model. Commun Stat-Simul Comput 2020; 49(5): 1223–1236. doi: 10.1080/03610918.2018.1494279

30.

Seliem

MM.

Handling outlier data as missing values by imputation methods: application of machine learning algorithms. Turkish J Comput Math Educ 2022; 13(1): 273–286. doi: 10.17762/turcomat.v13i1.12054

31.

Soliman

FMA

, Mohamed

, Abonazel

MR.

New robust estimators for the nonparametric regression model: application and simulation study. Int J Anal and Appl 2025; 23: 163–163.

32.

Ibacache-Pulgar

, Figueroa-Zuniga

, Marchant

Semiparametric additive beta regression models: inference and local influence diagnostics. REVSTAT-Stat J , 2021; 19(2):255–274. doi: 10.57805/revstat.v19i2.342

33.

Araújo

, Vasconcelos

JCS

, dos Santos

, . The zero-inflated negative binomial semiparametric regression model: application to number of failing grades data. Ann Data Sci 2023; 10: 991–1006. doi: 10.1007/s40745-021-00350-z

34.

Shao

, Wang

Generalized partial linear models with nonignorable dropouts. Metrika 2022; 85(2): 223–252. doi: 10.1007/s00184-021-00828-z

35.

Prataviera

, Ortega

EMM

, Cordeiro

, . The exponentiated power exponential semiparametric regression model. Commun Stat-Simul Comput 2022; 51(10): 5933–5953. doi: 10.1080/03610918.2020.1788585

36.

Millard

, Kanfer

FH.

Mixtures of semi-parametric generalized linear models. Symmetry 2022; 14(2): 409. doi: 10.3390/sym14020409

37.

Vasconcelos

JCS

, Cordeiro

, Ortega

EMM.

The semiparametric regression model for bimodal data with different penalized smoothers applied to climatology, ethanol and air quality data. J Appl Stat 2022; 49(1): 248–267. doi: 10.1080/02664763.2020.1803812

38.

Cardozo

, Paula

, Vanegas

L. H.

Generalized log-gamma additive partial linear models with P-spline smoothing. Statist Pap 2022; 63: 1953–1978. doi: 10.1007/s00362-022-01300-4

39.

Vasconcelos

JCS

, Cordeiro

, Ortega

EMM

, . A useful semiparametric regression for climatology. Communi Stat-Simul Comput 2024; 53(7): 3514–3530. doi: 10.1080/03610918.2022.2107220

40.

Seliem

, El-Sayed

, Abonazel

Developing a semi-parametric zero-inflated beta regression model using p-splines: simulation and application. Statist, Optim Inf Comput 2025; 13(3): 1103–1119.

41.

Lambert

Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 1992; 34(1): 1–14. doi: 10.2307/1269547

42.

, Fung

, Zhu

Robust estimation in generalized partial linear models for clustered data. J the Am Stat Assoc 2005; 100(472): 1176–1184. doi: 10.1198/016214505000000277

43.

Mouatassim

, Ezzahid

Poisson regression and Zero-inflated Poisson regression: application to private health insurance data. Eur Actuar J 2012; 2(2): 187–204. doi: 10.1007/s13385-012-0056-2

44.

Seliem

A Comparative Study of Some Estimation Methods for Partially Linear Model . Master’s thesis. Cairo University, Giza, Egypt; 2020.

45.

Goepp

, Bouaziz

and Nuel

Spline regression with automatic knot selection. Computational Stat & Data Anal 2025; 202. doi: 10.1016/j.csda.2024.108043

46.

O’Sullivan

A statistical perspective on ill-posed inverse problems. Stat Sci 1986; 1(4): 502–518. doi: 10.1214/ss/1177013525

47.

and

Eilers PHC

Marx BD Flexible smoothing with B-splines and penalties. Stat Sci 1996; 11(2): 89–121. doi: 10.1214/ss/1038425655

48.

Eilers

PHC

and Marx

Practical smoothing: the joys of P-splines . Cambridge University Press, 2021. ISBN 978-1108793155. doi: 10.1017/9781108610247

49.

Ruppert

, Wand

, and Carroll

Semiparametric regression . Cambridge University Press, 2003. ISBN 978-0521785167. doi: 10.1017/CBO9780511755453

50.

Feng

, Zhu

Semiparametric analysis of longitudinal zero-inflated count data. J Multivar Anal 2011; 102(1): 61–72. doi: 10.1016/j.jmva.2010.08.001

51.

Al-Taweel

and Algamal

Almost unbiased ridge estimator in the zero-inflated Poisson regression model. TWMS J Appl Eng Math 2022; 12(1): 235–246. doi: belgelik.isikun.edu.tr/xmlui/handle/iubelgelik/3403