A multi-state Markov model using notification data to estimate HIV incidence,number of undiagnosed individuals living with HIV,and delay between infection and diagnosis: Illustration in France,2008

Abstract

Thirty-five years since the discovery of the human immunodeficiency virus (HIV), the epidemic is still ongoing in France. To guide HIV prevention strategies and monitor their impact, it is essential to understand the dynamics of the HIV epidemic. The indicator for reporting the progress of new infections is the HIV incidence. Given that HIV is mainly transmitted by undiagnosed individuals and that earlier treatment leads to less HIV transmission, it is essential to know the number of infected people unaware of their HIV-positive status as well as the time between infection and diagnosis. Our approach is based on a non-homogeneous multi-state Markov model describing the progression of the HIV disease. We propose a penalized likelihood approach to estimate the HIV incidence curve as well as the diagnosis rates. The HIV incidence curve was approximated using cubic M-splines, while an approximation of the cross-validation criterion was used to estimate the smoothing parameter. In a simulation study, we evaluate the performance of the model for reconstructing the HIV incidence curve and diagnosis rates. The method is illustrated in the population of men who have sex with men using HIV surveillance data collected by the French Institute for Public Health Surveillance since 2004.

Keywords

HIV incidence multi-state Markov model penalized likelihood notification data surveillance

1 Introduction

In epidemiology, incidence is a major epidemiological indicator to assess the dynamics of a disease. For the past few years, several methods have been used to estimate human immunodeficiency virus (HIV) incidence in both developed and developing countries based on different data sources and different approaches.¹ The preferred way to estimate incidence is to conduct a prospective cohort study where individuals are followed over time and tested for seroconversion. However, such a cohort would need to follow a large number of individuals over a long period of time, which is difficult, expensive, and time-consuming. Alternative approaches exist such as repeated cross-sectional surveys using mathematical models fed by measures of prevalence^2,3 or a unique cross-sectional survey using a biomarker assay to identify recent infections.⁴ Finally, HIV or AIDS case-based surveillance is a last alternative to estimate HIV incidence.¹ If the surveillance system does not collect biological data, numerous methods can be developed such as back-calculation^5–7 and its derivatives (extended, Bayesian hierarchical based, CD4-based) or multi-state models.^8,9 If biological data (CD4 counts, biomarkers from assays for recent infection, etc.) are collected for each case, then several methods that incorporate such biological information are available.^10,11

In France, routine incidence testing with an enzyme immunoassay for recent HIV infections has been implemented.^10,11 Four models are used to estimate HIV incidence based on the French surveillance data.^6,9–11 The first two methods are based on a biological marker but include drawbacks such as uncertainties regarding the marker measurement, the time evolution of this marker since infection and the determination of a window period. The third method based on a back-calculation model also has some limitations, as it does not use the primary infection stage. For the last method, the model does not take into account changes in testing behavior over time.

To avoid these issues, we propose a multi-state model that mimics the natural history of HIV infection and distinguishes undiagnosed from newly diagnosed individuals. We adapted the approach proposed by Sommen et al.⁹ to the current French data by taking into account the primary infection stage. A penalized likelihood approach was used to obtain a smooth curve for HIV incidence. In this article, we use the term incidence to denote the number of incident cases, that is, the number of newly infected individuals. The penalized likelihood was calculated in a non-homogeneous Markov chain framework, based on a Poisson process. In other words, the diagnostic probabilities of individuals may change over time to take into account any changes in screening policies. The HIV infection curve was approximated using cubic M-splines, while an approximated cross-validation was used to estimate the penalty factor. This new model does take into account all the information about the clinical stage provided by a medical doctor at the time of the first positive HIV diagnosis. For the remainder of the manuscript, when we speak of diagnosis, we mean the first positive HIV diagnosis. Once individuals were diagnosed with HIV, they were considered to be captured by the surveillance system, and their subsequent states were not useful for estimating the incidence. This allows for potential changes in HIV testing behavior over time. Second, this model estimates the time between infection and diagnosis as well as the number of people not aware of their HIV-positive status. We decided not to include biomarkers, even though they are collected in France, in order to provide a model that can be used by other countries that do not routinely collect virological data in their surveillance systems.

Finally, the method is illustrated on the population of the men who have sex with men (MSM) based on data from the French HIV surveillance system. Furthermore, a simulated data set as close as possible to the HIV mandatory notification data set was created to evaluate the performance of the model for reconstructing the HIV incidence rate. Section 2 outlines the materials taken from the HIV mandatory notification system and the simulated data set. The Markov model is presented in Section 3, and the results are described in Section 4. Finally, we discuss the improvements to be made to this model and the potential perspectives, especially for the inclusion of virological data.

2 Materials

2.1 French HIV mandatory notification system

In addition to AIDS surveillance data, the collection of data on new HIV diagnoses (new positive HIV tests) allows better follow-up of the dynamics of the epidemic. The French HIV mandatory notification system was implemented in March 2003 by the French Institute for Public Health Surveillance in collaboration with health professionals, Health Ministry, and patient associations, while the French data protection authority was called on to design a comprehensive surveillance system respectful of patient rights.¹²

HIV mandatory notifications were initiated by microbiologists until 2016 and are initiated by medical doctors since 2017. They create a unique anonymous code for each individual. Some epidemiological and clinical information such as occupation, nationality, reason for testing, previous negatives for positive serology, clinical stage or mode of exposure are then supplied by the medical doctor who prescribe the test. At the time of diagnosis, a clinical stage for HIV is determined by a medical doctor: primary-infection, asymptomatic, symptomatic, or AIDS. Different terms are used in the literature to describe the primary-infection that defines the early stage of infection, although no universal definition is recognized. In the HIV mandatory notification system, primary-infection stage is defined as the period of intense viral replication, during which a person may have clinical manifestations (flu symptoms, pharyngitis, rash, superficial lymphadenopathy, etc.) that begin two to six weeks after infection. The asymptomatic stage is defined as the absence of clinical symptoms or signs related to HIV infection or the presence of generalized lymphadenopathy. The symptomatic non-AIDS stage is defined as the presence of severe clinical symptoms other than those defining AIDS. The AIDS stage is the last stage of HIV infection when severe immunosuppression is associated with opportunistic diseases.

Since 2003, virological surveillance is conducted to determine the virus type among the different HIV infection diagnoses and whether the infection is recent (less than six months). A test is used for recent infection, involving the quantification of the antibodies of two markers known as TM and V3.¹³

Weekly data on new HIV case reports were extracted, along with information on whether the case had an HIV diagnosis, the week and the year of the new HIV diagnosis, and the clinical stage at diagnosis. The data cover the period from the first week of 2004 to the last week of 2018. For the years before 2015, only the month and year of diagnosis were documented. In the HIV mandatory notification data, we did not observe weekly seasonality within each month. Thus, to assign a week in the month to these diagnoses, a uniform distribution of weeks was chosen for the period before 2015. The week was then made equal to a random number from 1 to 5 or 1 to 4 according to the month. In this article, we consider all cases included in the HIV mandatory notification up to the last week of 2018. Figure 2 at the end of Section 3.3 presents a summary of the dates and time units mentioned in the model and data. The annual numbers of new HIV diagnoses were estimated from the reported cases to take into account completeness of reporting.¹⁴ Even though HIV has been a mandatory notifiable disease in France since 2004, HIV cases are not exhaustively reported. In addition, non-negligible reporting delays must be considered when estimating the number of not-yet-reported cases. So in the first step, the number of diagnoses that are or will be reported to the surveillance system is estimated by taking into account the reporting delays using the Brookmeyer approach.¹⁵ In the second step, a laboratory survey is performed. Each participating laboratory declares the number of confirmed positive tests over a time period. The number of confirmed positive tests is estimated using survey techniques. Finally, the completeness of reporting is obtained by dividing these two estimates (i.e., estimated number of reported and confirmed cases). Missing data were treated through multiple imputations by chained equations.¹⁶ In the HIV mandatory notification, all variables with missing data are imputed for different analyses. Some imputed variables are not used for our analysis. For some variables, the proportion of missing data can reach 75%. Bodner¹⁷ and White et al.¹⁶ proposed the rule of thumb that the number of imputed data sets should be at least equal to the percentage of incomplete cases. This is why 75 complete databases are imputed and used. The 75 complete databases are constructed in two stages. The first step of the imputation creates 5 complete databases, while the second step creates 15 complete databases for each of the 5 complete databases from the first step. Thus, the 75 complete database were set, while the estimates and their variances were obtained through a combined analysis of these databases according to Rubin’s rules.¹⁸

2.2 Simulated data

The simulation study aimed to create a realistic data set as close as possible to the HIV mandatory notification data set in order to evaluate the performance of the model (described in Section 3) for estimating the HIV incidence and the other epidemiological indicators of interest (number of diagnosed/undiagnosed individuals). We wanted to create a simulation study that was not dependent on the model so as to be able to compare different models in future research. The simulation study involved several steps. First, we wanted to simulate an incidence between 1981 what we consider to be the start of the epidemic and the year 2018 using different sources of information (graphs, estimates, literature). This justified our choice to simulate the incidence (hereafter, the simulated incidence) in the following periods:

1981–2003: We considered that the HIV epidemic started in 1981 in France. We do not have year-to-year HIV incidence values for this period, although we do have an idea of the shape of the curve and the peak value in 1987. First, we generated the number of new HIV infection cases from a theoretical incidence beginning at 0 in 1981 to 15,000 in 1987, the plausible date of the peak, using a linear trend. We then generated the annual numbers of new HIV infection cases until 2004 using another linear relationship, decreasing from 15,000 to 6038, which was the number of new HIV infection cases in 2004 estimated from last published estimation based on the French HIV mandatory notification data.⁶ Although we do not have the incidence estimates before 2004, it is essential to simulate the incidence since the beginning of the epidemic because infected cases are not diagnosed until several years later and possibly after 2004, when the HIV mandatory notification began. Indeed, a person infected in a given year may be diagnosed and reported several years later. If we started the simulation of incident cases in 2004, we would have insufficient new diagnoses in 2004 and excessive new diagnoses for the following years. The simulation of the number of incident cases before 2004 is approximative, but it provides sufficient information to obtain a number of new diagnoses close to those observed in the HIV mandatory notification database for the years after 2004.

2004–2015: We used the annual numbers of new HIV infection cases estimated from a previous study.⁶ We noted λ_i as the number of new HIV infection cases for the year i ( $i = 2004, \dots, 2015$ ). Then for each year i, we generated the number of new HIV infection cases, noted y_i, according to a Poisson distribution such that $y_{i} \sim Poisson (λ_{i})$ . We hereafter call λ_i the theoretical incidence for year i.

2016–2018: The numbers of new HIV infection cases was simulated according to annual variation rates of −5% over the period.

We then simulated the time between infection and the first HIV-positive diagnosis (simply known as diagnosis for the rest of the article) which depended on both the clinical stage at the time of diagnosis and the testing behavior. As the distribution of the four HIV stages at diagnosis used in the HIV mandatory notification did not greatly vary over time, we chose the mean distribution over the period: 8.3% primary infection, 61.6% asymptomatic, 13% symptomatic, and 17.1% AIDS. For each individual, a clinical stage at HIV diagnosis was randomly assigned under the constraint of respecting this distribution, while the diagnosis date was simulated from the date of HIV infection according to a distribution specific to the clinical stage at diagnosis. Because the primary infection, symptomatic, and AIDS stages are associated with symptoms, the reason for diagnosis at these stages is mainly based on the presence of symptoms rather than test behavior. For the asymptomatic stage, simulated HIV test dates depend on the frequency of testing among diagnosed individuals because this stage is not associated with symptoms. For the individuals diagnosed at the asymptomatic stage, some are tested occasionally for various reasons including occasional risk-taking, whereas others are tested regularly because they engage in risky practices. Indeed, individuals have different HIV testing behaviors, and we distinguished between regular and non-regular testers. As the proportion of regular testers in the HIV mandatory notification has been stable since 2004, we chose a mean proportion of 23% of regular testers and used the following definition:¹⁹

A regular tester had his last negative HIV test in the two years prior to his positive test date.

A non-regular tester had his last negative HIV test more than two years prior to his positive test date or did not have a previous negative HIV test.

The two-year threshold is compatible with the observed delay between the last negative test and the positive diagnosis times in the HIV mandatory surveillance with a median of 18 months (1.5 years) and an average of 28 months (2.3 years).

Times from infection to diagnosis were simulated according to a distribution that depended on the stage and test behavior: for example, uniform for primary infection and Weibull for AIDS (see Supplemental Material for more details).

In this simulated database, we have a theoretical incidence, and for each individual, his/her date of diagnosis, clinical stage at diagnosis, date of infection, and testing behavior. The simulation study is described in Supplemental Material in greater detail.

3 Model

3.1 Description of the model

The multi-state model used to describe the progression of HIV infection, diagnosis, and pre-AIDS mortality is illustrated in Figure 1. We assumed that newly infected individuals enter into state 1 in continuous time according to a Poisson process with intensity $ν (t)$ . As the transition intensities may vary over time, we considered a non-homogeneous Markov process. Infected individuals progress successively through four clinical stages: (1) the primary infection stage defined as the intense viral period of replication starting two to six weeks after infection; (2) the asymptomatic stage defined as the absence of clinical symptoms or signs related to HIV infection or the presence of generalized lymphadenopathy; (3) the symptomatic non-AIDS stage defined as the presence of severe clinical symptoms other than those defining AIDS;²⁰ and (4) AIDS. The first level corresponds to the natural history of HIV, without medical care or diagnosis, apart from entry in the AIDS stage which inevitably leads to an HIV diagnosis. Transitions from the first to the second level represent patients access to diagnosis. In theory, all individuals have identical access to diagnosis, but in practice, this is not the case. Indeed, symptoms related to HIV are not specific, and their intensity differs from one individual to another; they may even go away without the doctor’s intervention in some cases. Once infected, an individual can be tested at any stage of the disease depending on symptoms or test behavior. At the HIV diagnosis time, the doctor completes information on the individual’s condition and determines in which stage of the infection he is: primary infection, asymptomatic, symptomatic without AIDS, AIDS. Furthermore, some individuals do not wish to be tested after taking a risk, as they fear knowing the result of the test. These reasons, among others, explain why the diagnosis is performed at different stages of the disease. Individuals at the primary infection stage (state 1) can stay at this stage, be diagnosed (with transition probability d₁) and reach state 5, progress to the next stage of the disease (with transition probability ρ₁₂) and reach state 2, or die (with transition probability m₁) and reach state 8. Individuals at the asymptomatic stage (state 2) can stay in this stage, be diagnosed (with transition probability d₂) and reach state 6, progress to the next stage of the disease (with transition probability ρ₂₃) and reach state 3, or die (with transition probability m₂) and reach state 8. Individuals at the symptomatic stage (state 3) can stay in state 3, be diagnosed and reach state 4 (with transition probability ρ₃₄), state 7 (with transition probability d₃), or die (with transition probability m₃) and reach state 8. We do not distinguish between undiagnosed and diagnosed AIDS stage, because the AIDS stage is the last stage of HIV infection, defined as severe HIV infection and an opportunistic disease. The diagnosis at the AIDS stage is given at the time of discovering the opportunistic disease, and as the opportunistic diseases associated with the AIDS stage present very severe symptoms, it is considered that the AIDS stage and the HIV diagnosis occur simultaneously.

Figure 1.

Multi-state Markov model describing the progression of HIV infection. The first level corresponds to the natural history of HIV with $ν (t)$ representing the number of new HIV infection at time t. The second level represents HIV diagnoses. The circled states correspond to the states for which data is available.

The HIV mandatory notification system provides information on the number of individuals entering states 4, 5, 6, and 7. The individuals observed in state 4 are individuals with an initial HIV-positive diagnosis at the AIDS stage. Surrounded states are those for which data are available.

The main objective is to estimate the transition intensity $ν (t)$ of the non-homogeneous Poisson process based on the HIV mandatory notification data in France. The transition intensities at the first level are known from previous studies.^7,21 However, pre-AIDS mortality is very difficult to estimate since individuals die before a diagnosis is made. For this reason, we considered the pre-AIDS mortality rate to be the same as in the general population. We therefore considered that all three HIV mortality rates were equal in the model.

To take into account the temporal evolution of the HIV epidemic, the Markov process is assumed to be non-homogeneous, meaning that the transition probabilities may vary over time. Given that new infections occur according to a Poisson process, the new entries in states 4, 5, 6, and 7 were also Poisson processes. Once individuals reached these states (4, 5, 6, and 7), they were considered to be captured by the surveillance system, and their subsequent states were not useful for estimating the incidence. For this reason, we considered here that states 4, 5, 6, and 7 were absorbing states. In this model, we do not take into account the age of the subjects. According to the literature,^22–24 introducing age can improve the estimates, but not taking it into account does not induce biased estimates if the study population and the distribution of incubation times are representative of the population. We chose our strata according to their screening heterogeneity and the indicators useful for monitoring prevention policies in France. Thus, we stratified our analyses by risk groups: MSM, heterosexuals born in France, and heterosexuals born abroad by sex (only MSM are shown in this article). Furthermore, adding age at this point would add more complexity to the stratification of the model. However, our model could be extended to take into account age in a similar way to the work of Brizzi et al.²⁵ although this represents a new study in itself.

3.2 Likelihood of the model

We adapted the approach proposed by Sommen et al.^7,9 to the French HIV surveillance data by taking into account the primary infection stage to improve the HIV estimates in the recent period. We proposed a likelihood function of the model described in the previous section in Figure 1. As we worked in discrete time, transition intensities can be replaced by their corresponding transition probabilities for a step of the discrete-time Markov chain. We consider a discrete-time partition $T_{i} = (t_{i - 1}, t_{i}], i = 0, 1, \dots, K$ .

Let P_i be the matrix of the discrete Markov chain transition probabilities for the time interval T_i, with t₁ being the first week of 1994, t_S being the first week of 2004, and t_K being the last week of 2018. The incidence is estimated in the period 1994–2018 because we assume that the HIV diagnostic data from 2004 potentially contain information to estimate the number of infected subjects up to 10 years earlier. We can note $P_{i} = (α_{k, l}^{i})$ , where $α_{k, l}^{i}$ is the transition probability from state k to state l between $t_{i - 1}$ and t_i with $k, l = 1, 2, \dots, 8$ . In each latent HIV state, one of the three following events could happen: disease progression, diagnosis, or death.

P_{i} = (\begin{matrix} α_{1, 1}^{i} & α_{1, 2}^{i} & 0 & 0 & α_{1, 5}^{i} & 0 & 0 & α_{1, 8}^{i} \\ 0 & α_{2, 2}^{i} & α_{2, 3}^{i} & 0 & 0 & α_{2, 6}^{i} & 0 & α_{2, 8}^{i} \\ 0 & 0 & α_{3, 3}^{i} & α_{3, 4}^{i} & 0 & 0 & α_{3, 7}^{i} & α_{3, 8}^{i} \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{matrix})

The elements of the matrix P can be defined as follows⁵

α_{k, l}^{i} = {\begin{array}{l} (1 - d_{k}^{i}) (1 - ρ_{k, k + 1}) (1 - m_{k}) if k = l \\ (1 - d_{k}^{i}) ρ_{k, k + 1} if k = l - 1 \\ d_{k}^{i} if k = l - 4 \\ (1 - d_{k}^{i}) (1 - ρ_{k, k + 1}) m_{k} if l = 8 and k = 1, 2, 3 \\ 0 otherwise \end{array}

For the discrete-time approximation, we assume that new infections enter at state 1 at each week T_i. The cumulative HIV incidence in $T_{i} = (t_{i - 1}, t_{i}], i = 1, \dots, K$ is expressed as $h_{i} = \int_{t_{i - 1}}^{t_{i}} ν (x) d x$ . Let H_i be an eight-dimensional column vector with h_i as the first component, and all other components equal to 0. We can express the expected number of individuals in states 1 to 8 at time t_i by introducing the vector $E_{i} = (E_{i, l}), l = 1, \dots, 8$ defined by

{\begin{array}{l} E_{0} = H_{0} \\ E_{i} = P_{i}^{T} E_{i - 1} + H_{i}, i = 1, 2, \dots, K \end{array}

The HIV mandatory notification provides information on the observed number of HIV diagnoses. We note $n_{i}^{l}$ as the observed number of individuals diagnosed at stage l, $l = 4, 5, 6, 7$ in week $T_{i}, i = 1, \dots, K$ . We note $e_{i}^{4} = E_{i - 1, 3} α_{3, 4}^{i}$ as the expected number of new AIDS diagnoses (entries in state 4), and $e_{i}^{l} = E_{i - 1, 1} α_{l - 4, l}^{i}$ (entries in state 5, state 6, and state 7) as the expected number of new diagnosis at stage l, l = 5, 6, 7 in week $T_{i}, i = 1, \dots, K$ .

Using the Poisson assumption, the likelihood of the model, for the estimation of the HIV incidence curve $ν (t)$ can be expressed as follows

\begin{array}{l} L = \prod_{i = S}^{K} {{(e_{i}^{4})}^{n_{i}^{4}} \exp (- e_{i}^{4})} \times {{(e_{i}^{5})}^{n_{i}^{5}} \exp (- e_{i}^{5})} \times {{(e_{i}^{6})}^{n_{i}^{6}} \exp (- e_{i}^{6})} \\ \times {{(e_{i}^{7})}^{n_{i}^{7}} \exp (- e_{i}^{7})} = \prod_{i = S}^{K} \prod_{j = 4}^{7} {(e_{i}^{j})}^{n_{i}^{j}} \exp (- e_{i}^{j}) \end{array}

The sum starts at time T_S, which corresponds to the first week of the year 2004 (start of data collection), and ends at time T_K, which corresponds to the last week of 2018. The log-likelihood function l is expressed as

l = \sum_{i = S}^{K} \sum_{j = 4}^{7} n_{i}^{j} \log (e_{i}^{j}) - e_{i}^{j} = \sum_{i = S}^{K} l_{i}

3.3 Penalized likelihood

It is desirable to produce a smooth estimate of the curve representing the incidence $ν (t)$ so that the function $ν (t)$ has no negative values, is continuous, and in the case of the HIV epidemic, has small local variations. A widely used method is to penalize the likelihood by a penalty term. By using a penalization of the function to optimize it, its properties become easier to understand or highlight. To solve this type of issue, the likelihood is penalized by a penalty term consisting of a penalty factor and a penalty function. The penalty factor λ is used as a smoothing parameter that determines the degree of smoothing of the infection curve $ν (t)$ . The larger the term, the smoother the infection curve. We used the penalization function $\int ν ″ {(u)}^{2} d u$ which is based on the second derivatives of $ν (t)$ because this penalization can smooth the curvature of the function.^26,27 The term of penalization has the following form

λ \int ν ″ {(u)}^{2} d u

We consider pl to be penalized log-likelihood of the form

p l = l - λ \int ν ″ {(u)}^{2} d u

Estimated parameters are the HIV diagnostic probabilities vectors $d_{1} = (d_{1}^{1}, d_{1}^{2}, \dots, d_{1}^{K}), d_{2} = (d_{2}^{1}, d_{2}^{2}, \dots, d_{2}^{K})$ and $d_{3} = (d_{3}^{1}, d_{3}^{2}, \dots, d_{3}^{K})$ at different intervals as well as the HIV incidence curve $ν (t)$ . For a fixed value of λ, the maximization of the penalized log-likelihood pl in $Θ_{λ} = (ν (.), d_{1}, d_{2}, d_{3})$ provides the penalized maximum likelihood estimators $\hat{ν} (.), {\hat{d}}_{1}, {\hat{d}}_{2}$ , and ${\hat{d}}_{3}$ for the model illustrated in Figure 1. We consider that the other parameters $ρ_{1, 2}, ρ_{2, 3}, ρ_{3, 4}$ , m₁, m₂, and m₃ are known and do not vary over time.

Since the estimator of the incidence curve $\hat{ν} (.)$ is only implicitly known by the maximum of the penalized likelihood, we propose using a cubic M-spline function of order 4 to obtain an approximation of the estimator $\hat{ν} (.)$ of the $ν (.)$ infection curve, which we noted $\tilde{v} (.)$ . The choice of the order is dictated by the penalty, since we derive the M-splines twice. To obtain a continuous and non-zero function for the second derivative, it is necessary to choose polynomials of degree 3, which corresponds to splines of order 4.

These splines are easy to manipulate because, like polynomials, they are easily differentiable. A spline function is entirely defined in a given interval [L;U] by a sequence of a limited number (Q) of increasing nodes $(L = τ_{1}, τ_{2}, \dots, τ_{Q} = U)$ and by the vector of the corresponding coefficients $θ = (θ_{1}, θ_{2}, \dots, θ_{Q + 2})$ . For the cubic splines, we consider that Q + 2 parameters are needed to approach $\hat{ν} (.)$ . In the case of cubic M-splines, the sequence of nodes is defined such that $τ_{1} \leq τ_{2} \leq \dots \leq τ_{Q}, τ_{1} = τ_{2} = τ_{3} = τ_{4} = L$ , and $U = τ_{Q} = τ_{Q - 1} = τ_{Q - 2} = τ_{Q - 3}$ (see²⁸ for more details). An M-spline of order k is defined by recurrence according to the following function

M_{i} (x | k, τ) = \frac{k [(x - τ_{i}) M_{i} (x | k - 1, τ) + (τ_{i + k} - x) M_{i + 1} (x | k - 1, τ)]}{(k - 1) (τ_{i + k} - τ_{i})}, τ_{i} \leq x \leq τ_{i + 1}

with

M_{i} (x | 1, τ) = {\begin{array}{l} \frac{1}{(τ_{i + 1} - τ_{i})} si τ_{i} \leq x \leq τ_{i + 1} \\ 0 otherwise . \end{array}

Each M_i is null outside the interval $[τ_{i}, τ_{i + k} [$ and not null over k intervals. In addition, for each interval, there are k non-zero M-splines. The penalized maximum likelihood estimator $\hat{ν} (.)$ is approximated by a linear combination of (Q + 2) M-splines

\tilde{ν} (.) = \sum_{j = 1}^{Q + 2} θ_{j} M_{j} (.)

We want to estimate the parameter vector ${\hat{Θ}}_{λ} = (\hat{θ}, {\hat{d}}_{1}, {\hat{d}}_{2}, {\hat{d}}_{3})$ , which maximizes the penalized log-likelihood pl.

It is not possible a priori to estimate the magnitude of the value taken by the smoothing parameter λ. Thus, an automatic method can be used to give a value of the smoothing parameter more rapidly. One technique to choose the smoothing parameter is the cross-validation method. This involves to minimizing the following function^29–31

V (λ) = - \frac{1}{K} \sum_{i = 1}^{K} l_{i} ({\hat{Θ}}_{λ}^{- i})

where K represents the number of intervals,

{\hat{Θ}}_{λ}^{- i}

is the penalized maximum likelihood estimator of

Θ_{λ}

for the private sample of the ith interval, and

l_{i} (.)

is the log-likelihood of the interval i. Minimizing V allows us to obtain an optimal approximation of λ, although the required computation time is very important since it is necessary to estimate a function for each observation. To bypass this problem, we use an approximation of

V (λ)

denoted

\bar{V} (λ)

, defined by^29,31

\bar{V} (λ) = - \frac{1}{K} (l ({\hat{Θ}}_{λ}) - t r ({\hat{G}}^{- 1} \hat{H}))

with l as the log-likelihood,

\hat{H} = - \frac{\partial^{2} l}{\partial Θ_{λ}^{2}} ({\hat{Θ}}_{λ})

and

\hat{G} = - \frac{\partial^{2} p l}{\partial Θ_{λ}^{2}} ({\hat{Θ}}_{λ})

A binary search algorithm is used for the minimization of the approximate cross-validation score. Once the value of the parameter λ is estimated, this value is integrated into the penalized likelihood pl which is then maximized using the Marquardt algorithm.³² Bayesian methods for obtaining point-to-point confidence bands when the approximation of the function to be estimated is obtained by combinations of splines were introduced by Whaba.³³ An approximate 95% Bayesian confidence interval for $\hat{ν (t)}$ is given by: $\tilde{ν} (t) \pm 1.96 \sqrt{M {(t)}^{T} {\hat{I}}_{θ}^{- 1} M (t)}$ where $M (t) = {(M_{1} (t), \dots, M_{Q + 2} (t))}^{T}$ is the M-splines vector and ${\hat{I}}_{θ} = - \frac{\partial^{2} p l}{\partial θ^{2}} (\hat{θ})$ . Figure 2 represents a summary of the dates and time units mentioned in the model and in the data used in Section 4.

Figure 2.

Summary of the dates and time units mentioned in the model and the data. The dotted square corresponds to the period where the results are presented. WW-MM-YY: Week-Month-Year.

3.4 Delay from infection to diagnosis

From the model presented in Section 3.1, we can estimate the HIV incidence, the number $E_{K, 1}, E_{K, 2}, E_{K, 3}$ of undiagnosed individuals at the primary infection, asymptomatic, and symptomatic stages at time T_K, as well as the cumulative distribution function (CDF) of the delay from infection to diagnosis. In the non-homogeneous model considered here, this distribution depends on the infection time. To calculate these delays, we only consider individuals who are diagnosed at the primary-infection, asymptomatic, or symptomatic stage. Let D_s be the delay from infection to diagnosis for individuals infected at time s. For the sake of simplicity, we consider a continuous-time process, and then we will go back to discrete-time process. The cumulative distribution function of D_s under these assumptions can be defined as follows

P (D_{s} \leq d) = \int_{0}^{d} d P_{15} (s, s + t) + d P_{16} (s, s + t) + d P_{17} (s, s + t)

We assume that the transition intensities are constant in a reduced number M of time periods W_j defined by $W_{j} = (τ_{j - 1}, τ_{j}], j = 1, 2, \dots, M$ but can vary between periods. The matrix $P (s, s + t)$ can be calculated as follows

P (s, s + t) = P^{(k_{s} + 1)} (s, τ_{k_{s} + 1}) P^{(k_{s} + 2)} (τ_{k_{s} + 1}, τ_{k_{s} + 2}) \times \dots \times P^{(k_{s + t} + 1)} (τ_{k_{s + t}}, s + t)

where k_s and

k_{s + t}

are defined such that

τ_{k_{s}} < s \leq τ_{k_{s} + 1}

and

τ_{k_{s + t}} < s + t \leq τ_{k_{s + t} + 1}

Each matrix in the product can be calculated as a function of a transition intensity matrix which is constant in the corresponding period W_j. The quantities $P_{15} (s, s + t), P_{16} (s, s + t)$ , and $P_{17} (s, s + t)$ are components of the matrix $P (s, s + t)$ . In an homogeneous model, $P (s, s + t) = P (t)$ , and the distribution of the delay D from infection to diagnosis does not depend on the time of infection and its CDF can be simply written as follows

P (D \leq d) = \int_{0}^{d} d P_{15} (t) + d P_{16} (t) + d P_{17} (t)

The calculation of the cumulative distribution function $P (D \leq d)$ of the delay D from infection to diagnosis is described in further detail in Supplemental Material.

4 Results

4.1 Practical assumptions

Our HIV notification data spans from the first week of 2004 to the last week of 2018, and all duration intervals T_i are in weeks. We decided to consider five calendar periods in which transition probabilities were assumed to be constant but can vary between periods: $W_{1} = [2004 - 2006], W_{2} = [2007 - 2009], W_{3} = [2010 - 2012], W_{4} = [2013 - 2015]$ , and $W_{5} = [2016 - 2018]$ . This partition arbitrary chosen so as to have periods of the same length and a reasonable number of parameters to estimate. In addition, it is easy to conceive that screening behaviors do not vary every one or two years. More specifically, we assumed that $α_{k, l}^{i} = α_{k, l}^{W_{j}}$ for all i such that $i \subset W_{j}, i = 1, \dots K$ , and $j = 1, 2, \dots, 5$ . This assumption allows us to no longer estimate d₁, d₂, and d₃ for all times T_i but only for the following five calendar periods: $W_{1} = [2004 - 2006], W_{2} = [2007 - 2009], W_{3} = [2010 - 2012], W_{4} = [2013 - 2015]$ , and $W_{5} = [2016 - 2018]$ . We estimated five values of $d_{1} = (d_{1}^{W_{1}}, d_{1}^{W_{2}}, d_{1}^{W_{3}}, d_{1}^{W_{4}}, d_{1}^{W_{5}})$ , five values of $d_{2} = (d_{2}^{W_{1}}, d_{2}^{W_{2}}, d_{2}^{W_{3}}, d_{2}^{W_{4}}, d_{2}^{W_{5}})$ , and five values of $d_{3} = (d_{3}^{W_{1}}, d_{3}^{W_{2}}, d_{3}^{W_{3}}, d_{3}^{W_{4}}, d_{3}^{W_{5}})$ . The estimates of $d_{1} d_{2}$ , and d₃ for the five calendar periods are presented in Table 3 in the results (Section 4).

In this model, it is necessary to specify the rates of disease progression and pre-AIDS mortality. The specific probabilities were chosen so as achieve a mean occupancy times of four weeks in state 1, 563 weeks (10.8 years) in state 2, and 52 weeks (1 year) in state 3. Hence, for all weeks, this correspond to $ρ_{1, 2} = 0.25, ρ_{2, 3} = 0.00193$ , and $ρ_{3, 4} = 0.0192$ , with a median AIDS incubation time of 8.6 years and a mean of 11.9 years in the absence of treatment.⁷ These estimates of median and mean incubation are taken as the benchmark and are used in many models.^8,9,21 Being diagnosed with HIV does not affect the incubation period until effective treatment has been initiated, so we can consider that these transition probabilities are constant over time. Finally, since antiretroviral treatment is administrated immediately after diagnosis, it is no longer possible to calculate this incubation period. Available information from the entire French surveillance system does not allow us to estimate the pre-AIDS mortality, because in the absence of post-mortem HIV screening, it is very difficult to know whether a person was HIV-positive when they died if he/she was unaware of his/her status. According to the French Institute for Demographic Studies, the median age at first sexual intercourse is 17 years. Moreover, after 70 years, the risk of health problems increases. We therefore assumed that mortality in the primary infection, asymptomatic, and symptomatic stages were the same as in the general population between 17 and 70 years, as provided by the French Epidemiological Center on the Medical Causes of Death. We obtained pre-AIDS probabilities $m_{1} = m_{2} = m_{3} = 5.577 \times 10^{- 5}$ .

To approximate the maximum of the penalized likelihood estimators ${\hat{Θ}}_{λ}$ , we used Q = 18 knots equidistantly distributed between 1994 and 2018. Because the degree of smoothing in the penalized likelihood method is determined by the smoothing parameter λ, increasing the number of knots would not provide a very different estimation but would rather lead to the estimation of an increasing number of parameters. The smoothing parameter was obtained by minimizing the cross-validation score, while the Marquardt algorithm³² was used to maximize the penalized likelihood.

We were less interested in the earlier years of incidence, instead focusing on the last 10 years. The figures are therefore shown for the years 2008–2018, which are our years of interest.

4.2 Simulation results

To evaluate the performance of the model for estimating both the HIV incidence and the transition probabilities for diagnosis, we performed 200 simulations. Minimizing the approximate cross-validation score for each of the 200 simulations was time-consuming, so we decided to minimize the approximate cross-validation score for only the first 10 simulations and consider the mean of these 10 approximate cross-validation scores. We obtained a mean of $\bar{λ} = 20$ (minimum = 10 and maximum = 31). For each of the 10 simulations, we compared the incidences obtained with the optimal value $\hat{λ}$ and the mean value $\bar{λ}$ to verify whether the use of $\bar{λ}$ gave consistent results. We fixed this value for all simulations. Of the 200 simulations performed, all the simulations converged. We used different initial parameter values for some simulations, and every time, the simulations converged to the same maximum. We present the results obtained from the 200 simulations with a decreasing trend of −5% in 2016–2018 (see Section 2.2). Values of the fixed transition probabilities (ρ₁₂, ρ₂₃, ρ₃₄, m₁, m₂, and m₃) and the estimated transition probabilities (d₁, d₂, and d₃) in the simulation for the five calendar periods are presented in Table 1. All the estimated transition probabilities d₁, d₂, and d₃ increase over time.

Table 1.

Values of the fixed and estimated transition probabilities in the simulation for the five calendar periods.

Transition probabilities	W₁ [2004–2006]	W₂ [2007–2009]	W₃ [2010–2012]	W₄ [2013–2015]	W₅ [2016–2018]
Fixed
ρ₁₂	0.25				0.25
ρ ₂₃	1.93 × 10^–3				1.93 × 10^–3
ρ ₂₃	1.92 × 10^–2				1.92 × 10^–2
m ₁	5.557 × 10^–5				5.557 × 10^–5
m ₂	5.557 × 10^–5				5.557 × 10^–5
m ₃	5.557 × 10^–5				5.557 × 10^–5
Estimated
d ₁	1.89 × 10^–2	2.15 × 10^–2	2.26 × 10^–2	2.25 × 10^–2	2.27 × 10^–2
d ₂	3.11 × 10^–3	3.11 × 10^–3	3.28 × 10^–3	3.51 × 10^–3	3.62 × 10^–3
d ₃	1.15 × 10^–2	1.12 × 10^–2	1.15 × 10^–2	1.24 × 10^–2	1.31 × 10^–2

Figure 3 represents the theoretical HIV incidence curve, the simulated HIV incidence, the estimated HIV incidence, and the confidence intervals of the estimated annual HIV incidence between 2008 and 2018. Theoretical HIV incidence belongs to the confidence interval (pointwise Bayesian confidence limits) of the estimated HIV incidence for 9 years among 11. Only the years 2013 and 2015 are not in the confidence interval. The theoretical HIV incidence seems to be underestimated. The values of the theoretical HIV incidence, the estimated HIV incidence, and the 95% confidence limits of the estimated HIV incidence for the years 2008–2018 are presented in Table 2.

Figure 3.

Theoretical (thick dashed line), simulated (gray strip), and estimated (solid curve) annual number of new HIV infections and the 95% pointwise Bayesian confidence limits (thin dashed lined) for the period 2008–2018.

Table 2.

Values for the theoretical and estimated HIV incidence as well as the 95% confidence limits of the estimated HIV incidence for the years 2008–2018.

Years	Estimated HIV incidence	Confidence limits of the estimated HIV incidence	Theoretical HIV incidence
2008	5973	[5561;6407]	5963
2009	5917	[5612;6158]	6090
2010	5842	[5520;6203]	6201
2011	5965	[5652;6319]	6222
2012	5965	[5678;6236]	6223
2013	5760	[5474;6057]	6134
2014	5831	[5420;6215]	6038
2015	5588	[5301;5965]	5975
2016	5376	[5011;5741]	5676
2017	5138	[4679;5481]	5392
2018	4749	[4158;5223]	5122

Table 3.

Values of the fixed and estimated transition probabilities in the French MSM population for the five calendar periods.

Transition probabilities	W₁ [2004–2006]	W₂ [2007–2009]	W₃ [2010–2012]	W₄ [2013–2015]	W₅ [2016–2018]
Fixed
ρ ₁₂	0.25				0.25
ρ ₂₃	1.93 × 10^–3				1.93 × 10^–3
ρ ₂₃	1.92 × 10^–2				1.92 × 10^–2
m ₁	5.557 × 10^–5				5.557 × 10^–5
m ₂	5.557 × 10^–5				5.557 × 10^–5
m ₃	5.557 × 10^–5				5.557 × 10^–5
Estimated
d ₁	5.17 × 10^–2	4.92 × 10^–2	5.38 × 10^–2	5.27 × 10^–2	5.64 × 10^–2
d ₂	5.07 × 10^–3	5.08 × 10^–3	5.66 × 10^–3	6.27 × 10^–3	6.58 × 10^–3
d ₃	1.72 × 10^–2	1.59 × 10^–2	1.79 × 10^–2	2.04 × 10^–2	2.30 × 10^–2

The incidence is not our only criterion of judgment. We also have at our disposal the theoretical annual number of individuals in each state ( $n_{i}^{j}, j = 4, 5, 6, 7$ and $i = 1, \dots, K$ ), the estimated annual number of individuals in each state ( $E_{i, j}, j = 4, 5, 6, 7$ and $i = 1, \dots, K$ ), and the confidence intervals of the estimated number of individuals in each state for the period 2008–2018 presented in Figure 4. The theoretical number of HIV diagnoses belongs to the confidence interval of the estimated number of diagnoses for each year.

Figure 4.

Theoretical (thick dashed line) and estimated (solid curve) annual number of individuals in each state of the model and the 95% pointwise Bayesian confidence limits (thin dashed line) for the period 2008–2018.

4.3 HIV mandatory notification results

The multi-state Markov model described in Section 3 is applied here to data for the global population of MSM taken from the French HIV mandatory notification system because this group has the highest risk of infection.

In the raw database of the HIV mandatory notification system, before taking into account reporting delays, completeness, and missing data, 83,283 new HIV diagnoses were reported between 2004 and 2018. Of the 83,283 new HIV declaration, 21,635 (26%) were in the MSM group, 32,654 (39.2%) in the heterosexual group, 1832 (2.2%) in the injection drug user group, and 982 (1.2%) in the other group (individuals without knowledge of their mode of contamination); 26,180 (31.4%) diagnosis had missing values. Of the 21,635 new declaration in the MSM group, 12,178 (56.3%) were at the asymptomatic stage, 3438 (15.9%) at the primary infection stage, 2048 (9.5%) at the symptomatic stage, and 2354 (10.9%) at the AIDS stage; 1617 (7.4%) diagnoses had missing values.

For the estimates, we use imputed data. Indeed, starting from the raw database, a multiple imputation of 75 databases was carried out to remove any missing data. The estimate is based on the 75 databases, with Rubin’s rules¹⁸ then being applied. The resulting estimates account for both within- and between-imputation variance, taking into account the uncertainty due to missing data. Multiple imputation is performed under the assumption of a missing at random mechanism, and under this assumption, we obtained unbiased parameter estimates when proportions of missing data are less than 50%.³⁴

Values of the fixed transition probabilities (ρ₁₂, ρ₂₃, ρ₃₄, m₁, m₂, and m₃) and the estimated transition probabilities (d₁, d₂, and d₃) in the French MSM population for the five calendar periods are presented in Table 3. All the estimated transition probabilities d₁, d₂, and d₃ increase over time.

Figure 5 represents the estimated HIV incidence curve in the French MSM population, and the confidence intervals of the estimated annual HIV incidence between 2008 and 2018. We observe that the HIV incidence in the French MSM population slightly increases from 2008, has a first peak in 2011, a second peak in 2014, a drop from 2014 to 2015, a stabilization from 2015 to 2017, and then slightly increases to 2018.

Figure 5.

Estimated (solid curve) annual number of new HIV infection in the French MSM population and the 95% pointwise Bayesian confidence limits (thin dashed) for the period 2008–2018.

To investigate the model’s goodness-of-fit, we show the observed and expected annual number of individuals in each state of the model in the French MSM population as well as the 95% pointwise Bayesian confidence limits for the period 2008–2018 in Figure 6. Note that the trend of the expected number of individuals in each state is correctly captured.

Figure 6.

Theoretical (thick dashed line) and estimated (solid curve) annual number of individuals in each state of the model in the French MSM population and the 95% pointwise Bayesian confidence limits (thin dashed line) for the period 2008–2018.

To compare the time between infection and diagnosis, we consider individuals infected in period W₁ and W₅. For the sake of simplicity, we assume that the transition probabilities for infected individuals in W₁ remain constant until W₅. With this assumption, we estimated the probability distribution of the delay between infection and diagnosis for the two groups of individuals. The evolution of the two cumulative distribution functions in the French MSM population is shown in Figure 7.

Figure 7.

Cumulative distribution function of the delay between infection and diagnosis for an individual infected between 2004 and 2006 (W₁) and between 2016 and 2018 (W₅) in the MSM population.

The median delay from infection to diagnosis for individuals infected in W₁ is 636 [593;685] days, while it is 492 [452;533] days for those infected in the last period W₅. The median delay from infection to diagnosis decreased by five months from individuals infected in the first period W₁ to those infected in the last period W₅. The proportion of people diagnosed at least one year after infection among those infected in the first period W₁ was 37% [35.2;38.6] compared to 42.6% [40.5;44.8] for those infected in the last period W₅. The proportion of people diagnosed at least one year after infection increased by 5.6% from individuals infected in the first period W₁ to those infected in the last period W₅. The proportion of people diagnosed at least two years after infection among those infected in the first period W₁ was 54% [51.9;56] compared to 61.8% [59.5;64.1] for those infected in the last period W₅. The proportion of people diagnosed at least two years after infection increased by 7.8% from individuals infected in the first period W₁ to those infected in the last period W₅.

5 Discussion

The multi-state Markov model presented in this article is an alternative approach for estimating the HIV incidence and the delay between infection and diagnosis. The model also allows us to estimate an important indicator, namely, the number of individuals living with HIV infection who are unaware of their status. A non-homogeneous Markov model is used to take into account information on the clinical stages at the time of HIV diagnosis and any potential changes in screening behavior over time. In France, different strategies have been implemented in recent years to promote the use of screening. In 2003, the use of condoms was promoted; in 2006, a campaign was launched to change the way in which people look at those with HIV; in 2010, screening was promoted; in 2015, rapid self-testing was implemented; and in 2017, the Haute autorit de sant published new recommendations.³⁵ These different strategies may have influenced the screening rates over time.

Splines are used to model the HIV incidence curve. They have the advantage of providing a smooth estimation of the infection curve without strong parametric assumptions.^36,37

In this study, the method was first illustrated on simulated data and then on data collected through the HIV mandatory notification system for the MSM population in France.

We created a realistic simulation data set that was as close as possible to the HIV mandatory notification data set to evaluate the ability of the model not only to reconstruct the HIV incidence curve, which is the main indicator, but also to estimate the number of individuals in each state and the number of undiagnosed individuals. The classical simulation approach would have been to fix the annual incidence as well as the diagnostic probabilities and then to let the infected individuals propagate in the different states according to the fixed probabilities. However, as we were unable to use the observed incidence data, we had to make an important assumption about the incidence in the simulation. Only the diagnosis data observed in each state at the time of HIV diagnosis could be used in the model to verify whether the model correctly reconstructed the incidence curve and whether it estimated the same probabilities of diagnosis. However, we did not want to use a classical simulation approach. Our idea was to test the model on data as close to reality as possible, while knowing that this choice of simulation involves heavy constraints that are not specific to our model.

The advantages of this method are that it estimates the incidence, the delay between infection and diagnosis, and the number of individuals unaware of their HIV status, while taking into account changes in testing behavior over time, which is not the case for the other methods currently used in France.^6,10 A secondary advantage of this method is its reasonable calculation time (48 h for the cross-validation and approximately 3 h for the maximization of the penalized likelihood). The limitation of the model is that the incidence fit is not perfect, probably due to the model complexity and the simulation constraints. One hypothesis regarding the underestimation of HIV incidence is that it probably derives from the constraint that we imposed to generate the diagnosis dates. Indeed, the diagnosis dates are simulated from the infection date according to a distribution that depends on the clinical stage at the time of diagnosis. In our model, we consider that an individual will pass into the different states according to the transition probabilities after diagnosis, which can deprive us of the tail of the distribution, thus underestimating the estimated impact. However, it is important to observe that the trend of the HIV infection curve fits well with the model. In addition, the confidence interval is reduced in the last three years compared to the model of Sommen et al.⁹ due to the use of the primary infection stage, which provides more information for recent periods.

The results obtained for the HIV mandatory notification data are consistent with previous estimates of HIV incidence in France using other approaches. Marty et al.⁶ estimated an HIV incidence of 2302 [2041;2628] in 2014 in the MSM group compared to our estimation of 2726 [2539;2913] in 2014 in the MSM group.

This method makes it possible to routinely estimate HIV indicators on an annual basis and more frequently, if necessary. This provides a more responsive measure of the dynamics of the HIV epidemic. This approach is not based on strong hypotheses specific to the French HIV data. It can be applied to any country with an HIV notification-based surveillance system without virological data. This work could potentially provide a framework to apply this method to other infectious diseases by changing the natural history of the disease. Information available in the HIV surveillance database that was not used for this study could be used to either improve the estimates or stratify the results according to variables known to be correlated with HIV incidence. In terms of improved estimates, especially for the most recent years, some biological markers are a sign of recent infection. Integrating these markers into the model would correct the underestimation observed for the most recent years. This integration of biological markers could thus improve the likelihood of the model by correcting the observed numbers at the time of diagnosis In terms of stratified analyses, it is well known that the HIV incidence is significantly associated with sex, mode of transmission, geographic origin, and region of residence.¹⁰ The first interesting perspective for future research would therefore be to give the epidemiological indicators based on these variables. Finally, the second interesting perspective for future research would use the simulated database described in this article to compare our method with other methods using clinical stages at diagnosis.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802211032697 - Supplemental material for A multi-state Markov model using notification data to estimate HIV incidence, number of undiagnosed individuals living with HIV, and delay between infection and diagnosis: Illustration in France, 2008–2018

Supplemental material, sj-pdf-1-smm-10.1177_09622802211032697 for A multi-state Markov model using notification data to estimate HIV incidence, number of undiagnosed individuals living with HIV, and delay between infection and diagnosis: Illustration in France, 2008–2018 by Charlotte Castel, Cecile Sommen, Yann Le Strat and Ahmadou Alioum in Statistical Methods in Medical Research

Footnotes

Acknowledgements

We thank our colleagues from the HIV/STI unit who provided insight and expertise, which greatly improved the research. We particularly acknowledge Franoise Cazein for providing us with the data set and for sharing her expertise in interpreting it.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was fully supported by the French Institute for Public Health Surveillance.

ORCID iD

Charlotte Castel

Supplemental Material

Supplemental material is available online.

References

UNAIDS/WHO working group on global HIV/AIDS and STI surveillance. Estimating HIV incidence using HIV case surveillance. Glion: UNAIDS, 2015.

Walker

Garcia-Calleja

Heaton

, et al. Epidemiological analysis of the quality of HIV sero-surveillance in the world: how well do we track the epidemic? AIDS 2001; 15: 1545–1554.

Barnighausen

Wallrauch

Welte

, et al. HIV incidence in rural South Africa: comparison of estimates from longitudinal surveilance and cross-sectional cbed assay testing. PLoS One 2008; 3.

Barin

Nardone

Monitoring HIV epidemiology using assays for recent infection: where are we?

Eurosurveillance 2008; 13.

Birrel

Chadborn

Noel Gill

, et al. Estimating trends in incidence, time-to-diagnosis and undiagnosed prevalence using a cd4-based Bayesian back-calculation. Stat Commun Inf Dis 2012; 4: 757–780.

Marty

Cazein

Panjo

, et al. Revealing geographical and population heterogeneity in HIV incidence, undiagnosed HIV prevalence and time to diagnosis to improve prevention and care: estimates for france. J Int AIDS Soc 2018; 21.

Aalen

Farewell

De Angelis

, et al. A Markov model for HIV disease progression including the effect of HIV diagnosis and treatment: application to aids prediction in England and Wales. Stat Med 2007; 16: 2191–2210.

Sweeting

De Angelis

Aalen

Bayesian back-calculation using a multi-state model with application to HIV.

Stat Med 2005; 24: 3991–4007.

Sommen

Alioum

Commenges

A multistate approach for estimating the incidence of human immunodeficiency virus by using HIV and AIDS french surveillance data.

Stat Med 2009; 28: 1554–1568.

10.

Le Vu

Le Strat

Barin

, et al. Population-based HIV-1 incidence in France, 200308:a modelling analysis. Lancet Inf Dis 2010; 10: 682–687.

11.

Sommen

Commenges

Le Vu

, et al. Estimation of the distribution of infection times using longitudinal serological markers of HIV: implications for the estimation of HIV incidence. Biometrics 2011; 67: 467–475.

12.

Lot

Cazein

Pillonel

, et al. Premiers rsultats du nouveau dispositif de surveillance de l’infection vih et situation du sida au 30 septembre 2003. Bulletin Epidmiologique Hebdomadaire 2004; 24: 102–110.

13.

Barin

Meyer

Lancar

, et al. Development and validation of an immunoassay for identification of recent human immunodeficiency virus type 1 infections and its use on dried serum spots. J Clin Microbiol 2005; 43: 4441–4447.

14.

Cazein

Sommen

Pillonel

, et al. HIV screening activity and circumstances of new HIV diagnoses, France 2018. Bulletin Epidmiologique Hebdomadaire 2019; 31: 615–624.

15.

Brookmeyer

Liao

The analysis of delays in disease reporting: methods and results for the acquired immunodeficiency syndrome.

Am J Epidemiol 1990; 132: 355–365.

16.

White

Royston

Wood

Multiple imputation using chained equations: issues and guidance for practice.

Stat Med 2009; 30: 377–399.

17.

Bodner

What improves with increased missing data imputations?

Struct Equat Model 2008; 15: 651–675.

18.

Rubin

Multiple imputation after 18+ years. J Am Stat Assoc 2012; 91: 473–489.

19.

Jamil

Prestage

Fairley

, et al. Effect of availability of HIV self-testing on HIV testing frequency in gay and bisexual men at high risk of infection (forth): a waiting-list randomised controlled trial. Lancet HIV 2017; 4: 241–250.

20.

Centers for Diseases Control and Prevention. Revised classification system for HIV infection and expanded surveillance case definition for aids among adolescents and adults. Morbid Mortal Wkly Rep 1992; 41: 1–19.

21.

Alioum

Commenges

Thibault

. A multistate approach for estimating the incidence of human immunodeficiency virus by using data from a prevalent cohort study. J R Stat Soc Ser C (Appl Stat) 2005; 54: 739–752.

22.

Becker

Marschner

A method for estimating the age-specific relative risk of HIV infection from aids incidence data. Biometrika 1993; 80: 165–178.

23.

Becker

Lewis

, et al. Age-specific back-projection of HIV diagnosis data. Stat Med 2003; 22: 2177–2190.

24.

Rosenberg

Backcalculation models of age-specific HIV incidence rates.

Stat Med 1994; 13: 1975–1990.

25.

Brizzi

Birrell

Plummer

, et al. Extending bayesian back-calculation to estimate age and time specific HIV incidence. Lifetime Data Anal 2019; 25: 757–780.

26.

Good

Gasking

Nonparametric roughness penalties for probability densities. Biometrika 1971; 58: 255–277.

27.

Good

Gasking

Density estimation and bump-hunting by the penalized likelihood method exemplified by scattering and meteorite data. J Am Stat Assoc 1980; 75: 42–56.

28.

Joly

Commenges

A penalized likelihood approach for arbitrarily consored and truncated data: application to age-specific incidence of dementia. Biometrics 1998; 54: 203–212.

29.

O’Sullivan

Yandell

Raynor

Automatic smoothing of regression functions in generalized linear models. J Am Stat Assoc 1986; 81: 96–103.

30.

Whaba

Spline models for observational data. Philadelphia: Society for Industrial and Applied Mathematics, 1990.

31.

Commenges

Joly

Ggout-Petit

, et al. Choise between semi-parametric estimators of Markov and non-Markov multi-state models from coarsened observations. Scand J Stat 2007; 34: 33–52.

32.

Marquardt

An algorithm for least-squares estimation of non linear parameters. SIAM J Appl Math 1963; 11: 431–441.

33.

Whaba

Bayesian confidence intervals for the cross-validated smoothing splines. J R Stat Soc B 1983; 45: 133–150.

34.

Ayilara

Zhang

Sajobi

, et al. Impact of missing data on bias and precision when estimating change in patient-reported outcomes from a clinical registry. Health Qual Life Outcomes 2019; 17.

35.

Haute Autorite de Sante. Reevaluation de la strategie de depistage de l’infection vih en france - synthese, conclusions et recommandations. Saint-Denis: Haute Autorite de Sante, 2017.

36.

Bellocco

Pagano

Multinomial analysis of smoothed HIV back-calculation models incorporating uncertainty in the aids incidence.

Stat Med 2001; 20: 2017–2033.

37.

Rosenberg

Goedert

Estimating the cumulative incidence of HIV infection among persons with haemophilia in the United States of America. Stat Med 1998; 17: 155–168.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.39 MB

A multi-state Markov model using notification data to estimate HIV incidence,number of undiagnosed individuals living with HIV,and delay between infection and diagnosis: Illustration in France,2008–2018

Abstract

Keywords

1 Introduction

2 Materials

2.1 French HIV mandatory notification system

2.2 Simulated data

3 Model

3.1 Description of the model

3.2 Likelihood of the model

3.3 Penalized likelihood

3.4 Delay from infection to diagnosis

4 Results

4.1 Practical assumptions

4.2 Simulation results

4.3 HIV mandatory notification results

5 Discussion

Supplemental Material

sj-pdf-1-smm-10.1177_09622802211032697 - Supplemental material for A multi-state Markov model using notification data to estimate HIV incidence, number of undiagnosed individuals living with HIV, and delay between infection and diagnosis: Illustration in France, 2008–2018

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

ORCID iD

Supplemental Material

References

Supplementary Material