Abstract
Computer-based disease models are only as accurate as the parameters that underlie them. Model calibration is an important technique for ensuring that input parameters, which may not be directly observable, are consistent with observed disease data. Calibration can be especially important in contexts in which there is substantial uncertainty about model parameters, for example, in many low- and middle-income countries that lack active disease surveillance systems. However, calibration is also particularly challenging in these settings, because there is usually a greater degree of prior uncertainty around model parameter values and fewer observed data points with which to inform these values. While there are a range of calibration techniques available,1,2 it is not clear which technique to prefer in a particular situation.
Modeling cholera provides an important, policy-relevant example through which to better understand these challenges. Cholera is an acute diarrheal disease that accounted for about 100,000 deaths worldwide in 2016, mostly in low- and middle-income countries. 3 Bangladesh is one of 51 countries where cholera is considered endemic. 4 The combination of rapid urbanization and climate change in Bangladesh make cholera an increasingly important public health priority.5,6 Given the thousands of deaths and more than 100,000 cases observed in Bangladesh each year, the government has pledged to eliminate the disease by 2030.7,8 Policy makers must consider a range of interventions, including treatment, vaccines, and water and sanitation programs. Modeling the effects of cholera control interventions requires an accurate model of cholera epidemiology in Bangladesh; a model based on biased data will result in biased predictions. However, to date, publicly available data on the number of cholera cases in Bangladesh have been generated through passive, hospital-based surveillance systems. Underestimates of incidence and overestimates of severity can result because milder cholera cases can be successfully treated at home. 9 Treatment-seeking behavior for diarrheal diseases tends to vary with age,10,11 which adds additional potential bias.
To address these burden estimations and prioritization challenges, we developed a model of cholera natural history in Bangladesh. We used this model to assess the strengths and weaknesses of 3 calibration techniques in addressing biases in the extant data and reflecting appropriate uncertainty given the indirect nature of the available evidence. To gauge the magnitudes of potential bias in outcomes, we compared predictions from calibrated models to those from models with naïve uncalibrated parameter estimates that do not account for possible biases. While we operationalized calibration in the context of cholera in Bangladesh, our comparisons and methods are broadly applicable to other situations in which data on a disease may be biased, including passive surveillance systems for other diseases in low- and middle-income countries.
Methods
Model Structure and Assumptions
We constructed an age-stratified, state-transition cohort Markov model in Python (version 3.6). Our model follows an urban Bangladeshi population, tracking cholera cases, cholera deaths, and deaths from unrelated causes at 6-month intervals (Figure 1). For each cycle, a proportion of each age-stratified cohort contracts cholera, remains uninfected, or dies of unrelated causes. Cholera can be severe or mild, based on the presence of dehydration and other clinical criteria, 12 and both severe and mild cases can be treated either at home with oral rehydration salts or in a hospital; severe cases only can result in death, a risk greatly reduced by hospital-based treatment. The probabilities that an individual contracts cholera (underlying incidence), that a cholera case is severe (underlying severity), that hospital-level treatment is sought (hospitalization), and that cholera results in death (case fatality) depend on an individual’s age group (<1, 1–4, 5–14, or 15+ years) and are constant (we do not use a dynamic transmission model). Because cholera cases are reported only when individuals seek treatment at a hospital, by tracking hospitalization we are able to track model predictions corresponding to reported outcomes.

Illustrative cholera natural history model schematic. Note: in this model, only severe cholera results in excess mortality risk.
Given potential biases from passive hospital-based surveillance, we can only directly inform the model’s all-cause age-specific mortality rates from World Health Organization (WHO) life tables. 13 For the rest of the parameters, we calibrate the model to generate sets of parameters that produce outputs consistent with data on reported cases and deaths in Bangladesh. Per standard practice, 1 all of our calibration methods involve sampling from broad feasible ranges for the 24 unknown parameters, running the model with each sampled set of inputs, comparing model-simulated results for each parameter set to observed outcomes (targets), and retaining parameter sets whose results are consistent with these targets.
Calibration Targets
Targets included the estimated percentage of deaths in Bangladesh caused by cholera 3 and hospital-based total and severe cholera cases per 1000 population in Dhaka, Bangladesh, all stratified by age group (Table 1). We explicitly account for the hospital-based passive nature of the cholera surveillance system by modeling hospitalization (and thus the reporting of cases). We combined multiple studies of cholera incidence based on hospital-based surveillance methods14–18 using meta-regression (see the Supplementary Appendix). Estimates of cholera deaths come from the Institute for Health Metrics and Evaluation’s Global Burden of Disease (GBD) study, a comprehensive epidemiological study that estimates age-stratified mortality and morbidity by cause (including cholera) and country. 3 For their mortality estimates, the GBD studies combine observational data and key covariates (such as income and education) from several sources (surveys, censuses, sample registration systems, disease surveillance, vital registration systems) using spatio-temporal and Gaussian process regressions. 19 Although we have no reason to believe the GBD estimates are biased in one direction or another (unlike the cholera incidence estimates), bias in point estimates are possible; however, results are presented with confidence intervals that propagate uncertainty from the underlying data and the modeling approaches used. To reflect additional uncertainty in the incidence and deaths data, we fit distributions to the 99% confidence intervals that were implied by the means and standard errors from the incidence meta-regression and the published 99% confidence intervals for the Bangladesh cholera mortality estimates, treating both sets of 99% confidence intervals as though they were 95% confidence intervals, which resulted in wider target distributions.
Calibration Targets
The 99% confidence intervals are shown in brackets. The observed incidence data are based on passive, hospital-based surveillance, whereas mortality data are more representative of overall cholera cases in Bangladesh. See Appendix Table S1 for more details. IMHE, Institute for Health Metrics and Evaluation.
Initial Distributions
We first established feasible initial distributions for the 24 parameters directly consistent with the available literature (hereafter referred to as “naïve distributions”) and then widened them substantially to reflect additional uncertainty due to features of their collection methods and differences between what is measured in the literature and what the model parameters attempt to measure and to account for risks of bias due to underreporting (Table 2). In some cases, where the published literature was especially weak or no relevant data were available, we used the entire feasible region as our initial (ie, prior) distribution (eg, 0%–100% for hospitalization). In other instances, the widened bounds were informed by consideration of epidemiological methods and subject matter expertise. 20
Overview of Cholera Model Parameters
Beta distributions are shown as (a, b).
Uniform distributions are shown as (a, b), where a is the minimum value and b is the maximum value.
We define a severe case as a cholera case in which severe dehydration occurred, where severe dehydration is defined according to WHO criteria as at least 2 of the following symptoms: sunken eyes, dry tongue, thirst, irritable condition, less active than usual, skin-pinch goes back slowly, low volume of radial pulse along with inability to drink, uncountable or absent radial pulse. Studies varied slightly in the definition of severe cholera, with some defining severity only based on a subset of these symptoms.
Each mild case hospitalization sample is drawn from a uniform distribution, with 0 as the lower bound and severe case hospitalization as the upper bound. Because severe case hospitalization also varies across samples, the resulting distribution for these parameters ends up being nonuniform and correlated. We also explicitly induce correlation between case severity across age groups. We relax the assumption used in most other cholera models that this parameter is identical across all age groups and instead impose correlations of 0.5 to account for the fact that the proportion of cholera cases that are severe may vary somewhat across age groups in endemic settings such as Bangladesh. This correlation structure was achieved while preserving the marginal distributions of the parameters. 32
For example, we did not identify any published studies that measured the proportion of all underlying cholera cases that are severe, although 20% is widely assumed in the literature. 9 Thus, for this parameter, we established an upper bound of 50% after discussion with our coauthor, who is an expert on diarrheal disease epidemiology and public health measures to address diarrheal diseases in low- and middle-income countries, using 0% (the lowest possible bound) for the other end of the prior. For incidence, all studies were based on passive surveillance at hospitals and other health facilities. We established ranges for incidence that included estimates from the studies and their confidence intervals (essentially assuming 100% detection) and widened the upper bounds of these ranges to account for much lower possible case reporting. Reporting fractions are largely unknown; one study of hospitals in Bangladesh estimated that they range from 33% to 92%, 14 although this study was based on health care utilization surveys for severe diarrhea, not cholera or severe cholera specifically, and the estimates are limited by assumptions around catchment areas and recall. Thus, we established bounds that included estimates from the incidence studies after adjusting for these and lower reporting fractions (given that reporting for mild cases is likely to be lower than severe) and confirmed these bounds for face validity with our coauthor.
The prior distributions were all uniform. Starting calibration with wide, uniform prior distributions avoids excluding potential plausible parameter sets or favoring certain parameter values over others in initial sampling. Our goal in using wide prior distributions was to allow the data represented by the calibration targets to strongly inform our posteriors and avoid propagating potential biases from the literature.
Calibration Methods and Search Strategies
We compared 3 calibration methods: random search with rejection sampling, sampling importance resampling (SIR), and incremental mixture importance sampling (IMIS).1,2,33–36
We considered random search because it is simple to implement and widely used.37–39 We randomly sampled repeatedly from the initial joint uncertainty distribution of the model’s parameters (see the Supplementary Appendix for details on efficient large-scale sampling and processing). We retained all parameter sets, which generated outputs that fit simultaneously within the confidence intervals of the targets, discarding all others. 40 The retained parameter sets formed the posterior distribution. We chose a simple rejection sampling algorithm because of the uncertainty of the available evidence, as we did not necessarily want to penalize parameter sets that fit within the confidence intervals but were farther away from the means of the targets. Similar approaches have been described in the literature.1,33
We used 2 other explicitly Bayesian calibration methods: SIR and IMIS.2,34–36 We chose methods that were natural extensions of random search and had been used in the health literature previously.36,41 For the Bayesian approaches, we needed to translate our calibration targets into a likelihood function. While each target for each age group could have been treated as independent, leading to a multiplicative likelihood, such an approach would almost certainly imply too much independent information in each target in a way that is inconsistent with the hypothesized data-generating process and likewise would imply that zig-zagging patterns of age-specific outcomes could be as likely as smoothly varying patterns across age groups. Hence, we assumed correlations between the uncertainty intervals across age groups within each target type. For both SIR and IMIS, we used the same multivariate normal likelihood function to evaluate the fit of model output to the targets (see the Supplementary Appendix).
Our implementation of SIR was similar to random search. However, instead of simply accepting or rejecting each parameter set, we retained all parameter sets but resampled them with a replacement based on weights equal to their likelihood given the calibration targets divided by the sum of likelihoods across all parameter sets (see the Supplementary Appendix). The weighted sampled parameter sets form the joint posterior uncertainty distribution.
IMIS starts out similarly to SIR but then refines the SIR estimate of the joint posterior through additional steps. After initial random sampling, additional focused sampling of the parameter space is done more densely near the current best-fitting (highest-likelihood) parameter set from random sampling (see the Supplementary Appendix). One can think of IMIS as incorporating the more targeted search capabilities found in directed search approaches within a Bayesian framework. The resulting parameter sets (from both initial random sampling and additional focused sampling), with likelihood-based weights, are sampled with replacement to form the joint posterior uncertainty distribution.
IMIS Adaptations
We made several adaptations to the original IMIS algorithm to tailor it to our situation of wide prior uncertainty and many degrees of freedom. It took many random samples to achieve reasonable coverage of the 24-dimensional prior parameter space. Such large-scale sampling and the resulting simulation data required substantial time and memory. In addition to the steps taken in SIR to save time and memory (see the Supplementary Appendix), we also decreased the number of times files needed to be opened and sorted as part of the subsequent focused sampling steps (see the Supplementary Appendix). We tested the IMIS algorithm with and without the time-saving steps and found that results were not affected.
IMIS performs best when the initial sampling stage can identify the area of the parameter space that has a single-peaked maximum likelihood. It is then very efficient in covering that part of the space densely while still ensuring that all feasible parameter sets have nonzero likelihood of being in the posterior. However, not all posteriors have a single maximum likelihood. Furthermore, when the size of the initial parameter space is large, it may be difficult to get close to the globally best-fitting parameter set via random sampling, even when the distributions are single peaked. Like other calibration procedures that include focused sampling to gain efficiency, such as directed search or Markov Chain Monte Carlo (MCMC), IMIS may also focus sampling in areas of early peaks in likelihood. We found that the best-fitting parameter set from the initial random sampling stage was different each time we reran IMIS because of the low sampling density relative to the high dimensionality or the parameter space; 12 targets to inform 24 parameters yields underidentifiability and the possibility of many parameter sets fitting the targets equally well. 42
Similar to the other techniques, the solution is to run multiple IMIS chains.43,44 We ran IMIS 20 separate times and combined the results, weighting parameter sets based on their likelihood divided by the sum of likelihoods across all 20 sets of parameters. This extension of IMIS tended to expand the posterior uncertainty interval, even as it increased the average likelihood of parameters sets retained.
Assessment of Methods
An ideal calibration procedure is computationally feasible, achieves good concordance with and coverage of the calibration targets (high likelihoods for posterior parameter sets), and produces smooth, stable estimates of the posterior marginal distributions and correlation structure. We compared the 3 calibration methods across a set of metrics that captured these properties. Some of these metrics relied on calculating the likelihood; we used the same single multivariate normal likelihood function for all 3 calibration methods (Supplementary Appendix).
To assess how well each calibration method performed in terms of reducing potential bias in predictions and their uncertainty, such as underreporting, we compared predicted summary measures of burdens using posterior parameters from each method to each other (to establish consistency across calibration methods and stability of predicted burden estimates) and also to burden predictions generated from uncalibrated naïve parameter distributions that were extracted from existing literature without adjustment for bias or heterogeneity (Table 2), to assess the extent to which methods may overcome bias. Naïve estimates of incidence came from passive surveillance studies without adjusting for underreporting. For other parameters, because of data scarcity, we used data from different populations (e.g., other countries or other diarrheal diseases) or, when that was not available, estimates frequently used in other cholera models (Table 2).
Role of the Funding Source
Funding sources had no role in the study.
Results
Comparisons across Calibration Methods
All 3 calibration methods succeeded in generating multiple parameter sets with face validity, high likelihoods, and good fits to the targets (Table 3; Figure 2). IMIS performed best, yielding many more parameter sets that fit within the targets, with higher overall likelihood across these parameter sets and a much higher likelihood for the best parameter set it identified (Table 3). In addition, model outputs using IMIS’s posterior parameter sets were on average closer to the target means (Figure 2). One implied advantage of having many unique posterior parameters whose results are centered on the target means is that when the posterior is sampled and propagated to produce estimates of cholera burden or project the impact of cholera control interventions, the estimates and confidence intervals produced will likely comport with the amount of information currently available. Furthermore, by producing many more unique posterior parameters sets, IMIS provides a more stable characterization of posterior uncertainty. Graphically, this is manifest in Figure 3, which provides examples of the marginal posterior distributions from IMIS being smoother than those of the other methods, especially random search (see also Supplementary Appendix Figure S2).
Calibration Performance Metrics of Posterior Parameters
CI, confidence interval; IMIS, incremental mixture importance sampling; SIR, sampling importance resampling.
The average Euclidian distance between each parameter set in the posterior is reduced somewhat compared with the average distance in the prior (1.22)—0.92 for IMIS, 1.09 for SIR, and 0.94 for random search—showing that the posterior distributions still have reasonable coverage of the parameter space and that each draw from the posterior distributions represents a distinct parameter set. This is shown graphically in Supplementary Appendix Figure S2.
The time estimates for IMIS and SIR reflect limits on the volume of parallel jobs that can be run at the same time for the particular high-performance computer cluster that we used. If these limits were not in place, the time to complete calibration would be closer to 5.0 days for IMIS and 4.3 days for SIR.

Fit of prior and posterior parameters to targets. Black lines represent the target means and confidence intervals. Colored/gray lines show results from individual parameter sets (top row, incremental mixture importance sampling; second row, sampling importance resampling; third row, random search; fourth row, priors). Note the different y-axis scale on the prior graphs, indicating that most draws from prior distributions generated output that did not fit within the target confidence intervals.

Comparison of posterior parameter distributions.
Even though it was by far the fastest method, in absolute terms, and the rejection sampling used in random search calibration yields posterior parameter sets that all fit within all of the target confidence intervals by design, its inability to generate many such sets means that it can have trouble characterizing a stable posterior. Furthermore, because of sparse sampling, many of the sets that it did identify were further toward the edges of the target confidence intervals, and hence, likelihoods were on average lower for random search. Notably, restricting IMIS sampling to those parameter sets that fit within all the target confidence intervals did not substantially alter its posterior distribution (Supplementary Appendix Figure S3). Unlike IMIS, only 1% of SIR posteriors fit within all target confidence intervals, largely because of the targets relating to deaths (Figure 2).
Although IMIS took slightly longer to run, it generated the most unique posterior parameters per unit time. IMIS was also able to use the targets more efficiently to better inform the posterior distribution compared with SIR. Figure 3 shows that the IMIS posterior distributions are often more peaked and differ more from the prior distributions than the SIR posteriors do (see also Supplementary Appendix Figure S2). Because samples from the prior distributions are frequently inconsistent with the calibration targets (Figure 2), parameter distributions that generate outputs consistent with the targets and have different shape and mean than the prior distributions reflect improvement.
Detailed IMIS Calibration Results
Given its advantages, the remainder of the results focus on outcomes from IMIS calibration (see the Supplementary Appendix for more outcomes from SIR and random search).
Consistent with the available literature, posterior cholera incidence for all ages was at the lower end of the prior ranges and was higher for children than adults: 6-month incidence of 16.9 [2.8–56.6] cases per 1000 among infants and 7.0 [3.1–17.9] among children aged 1 to 4 years versus 3 to 4 cases per 1000 among older ages (Figure 3; Supplementary Appendix Figure S2). Even so, as anticipated, these estimates are larger than literature estimates based on passive surveillance14,17 (Supplementary Appendix Figure S4). For all ages greater than 1 year, hospital-based treatment for severe cholera cases is greater than 90% but is substantially lower for mild cases (e.g., closer to 25% for older age groups), which is consistent with expected treatment-seeking behavior. Our results confirm that hospital-based treatment for severe cholera cases is effective at preventing mortality (e.g., case fatality is as low as 0.1% to 3% for adults but somewhat higher and more uncertain for children). This finding is consistent with data from hospitals in Bangladesh, which has also tended to find a higher rate of fatalities among children than adults hospitalized for cholera. 31 In contrast, case fatality for severe cholera patients who are not hospitalized is high although difficult to characterize precisely. Sources including those based on active surveillance from other countries report case fatality of 50%,37,45 consistent with our posterior estimates of case fatality for severe cases absent hospitalization.
The marginal posterior distributions of parameters that indirectly relate to target outcomes (e.g., the fraction of cholera cases that are severe) were less informed by calibration (Supplementary Appendix Figure S2). As there are less target data for infants, this is also the case for most parameters relating to this age group. However, even for parameters whose marginal posterior distributions do not differ substantially from the priors, calibration informs their correlations with other parameters (see Supplementary Appendix Figures S5 and S6).
Cholera Burden: Comparison with Uncalibrated Estimates
The calibrated parameters have important implications for predicted cholera burden. Calibrated parameter distributions differed not only from the uninformed priors but also from parameter estimates one might derive directly from the literature, without adjusting for biases and differences between what is measured in the literature and what our parameters represent (Table 2; Supplementary Appendix Figure S3). We compared the overall cholera disease burden estimated based on posterior parameter distributions from IMIS, SIR, and random search with disease burden estimates using naïve uncalibrated parameters (Figure 4; Supplementary Appendix Tables S4 and S5; Supplementary Appendix Figure S7).

Cholera burden estimates using naïve versus calibrated parameters.
Overall estimates of cholera cases and severe cholera cases generated using naïve parameters are substantially lower than estimates using parameters from all 3 calibration methods, whereas naïve estimates of cholera deaths are substantially higher than calibrated estimates. Overreliance on the literature at risk of underreporting bias can lead to underestimates of both the expected burden but also of uncertainty. IMIS-based estimates of burden are 499 thousand (196 thousand to 1 million) cholera cases every year, whereas estimates based on naïve parameters are 278 (190 to 388) thousand cases. Naive parameters also underestimate severe cases compared with calibrated parameters, although to a lesser degree. These differences are primarily driven by overall lower underlying incidence estimates, because the naïve parameters do not account for underreporting (Supplementary Appendix Figure S4).
The opposite holds for predicted deaths; we would substantially overestimate deaths using naïve parameters. Using calibrated parameters from IMIS yields an overall estimate of 2.7 (1.2 to 5.6) thousand deaths annually, compared with 13.9 (5.6 to 27.5) thousand annually using naïve parameters. These differences are primarily due to lower naïve estimated hospitalization rates for severe cases (especially for older age groups; see Appendix Figures S4 and S7), as well as higher nonhospitalized case fatality. The naïve estimates of these parameters, especially for older age groups, come from studies of different diarrheal diseases (Table 2) and produce estimates of case mix (severe v. mild) in hospitals that differ substantially from observed data in urban Bangladesh. Models with naïve parameters estimate far more nonhospitalized severe cases, translating into more nonhospitalized fatalities. Hospitalized fatalities are actually lower using the naïve parameters than the calibrated parameters. Calibration also reduces uncertainty around deaths—especially those that occur outside of hospitals—compared with naïve parameters.
Draws from the naïve parameter distribution are far more likely to produce output that is incompatible with 1 or more calibration targets (Figure 5) and has lower likelihood, indicating that calibration using IMIS was a substantial improvement over estimates based on the literature without adjusting for possible biases; its outputs are credible compared with observed data, and hence, its projections are more credible as well.

Fit of naïve parameter outputs to targets. Black lines represent the target means and confidence intervals. Gray lines show results from individual parameter sets drawn from the naïve parameter distributions.
Cholera Burden: Comparison with Calibration Methods
Comparisons of cholera burden estimated using the 3 calibration techniques mirror comparisons of the posterior parameters (Supplementary Appendix Table S4, Table S5, Figure S7). The IMIS posterior yields smoother estimates of cases, severe cases, and deaths than random search posteriors, based on many more unique posterior parameter sets, which have higher face validity and stability. SIR estimates are characterized by greater uncertainty and thicker right tails, which results in higher estimates of burden and wider confidence intervals that vastly exceed the IMIS mean and uncertainty. As SIR’s posterior parameter sets have lower likelihoods on average, this is reflected in more uncertain and potentially upward biased estimates of cholera burden.
Discussion
We find that our adaptation of IMIS was in general a better-performing method across a range of metrics, including number of unique good-fitting parameter sets generated, goodness of fit to target data, and the degree to which posterior parameters and burden estimated using these parameters is informed compared with priors. However, we also identified some advantages and disadvantages of each method. Random search is flexible and relatively simple to understand and code. It does not require that the user choose an explicit likelihood or even goodness-of-fit function, because accept/reject or other criteria can be used. Both random search and SIR allow the user to generate parameters with any types of constraints (e.g., correlations), and because the resulting parameter sets are a subset of those samples, these constraints will be maintained. Iterations of SIR, and especially random search, can be run quickly and in parallel. Furthermore, with rejection sampling (for random search), the results can be contained in one relatively small file. However, in our case, this single file was too small, with only 42 parameter sets after sampling 50 billion. Because they are completely undirected, random search and SIR perform less well given large search spaces (either because of a higher-dimensional parameter space or wide feasible ranges for each parameter). Notably, IMIS can also perform less well in large search spaces for other reasons, discussed below, but we were able to address these weaknesses through our adaptations of the original IMIS algorithm. Our implementation of random search may also be less rigorous than Bayesian calibration methods in the interpretation of the posterior distribution. Although the random search parameters can be thought of as a posterior if we assume that observed incidence and deaths follow uniform distributions, these probabilities are more likely to follow beta or normal distributions.
SIR has some advantages over random search. Because of its weighting methodology, the average likelihood was higher (since good-fitting parameters appear more frequently when sampling from the posterior distribution), and resulting distributions were smoother because more posterior parameters were sampled. However, SIR was more difficult to run. It required more time and space than random search because each parameter set file had to be retained and then reopened multiple times to compute the sum of likelihood and generate files sorted by likelihood. We also wanted to explore calibration methods that could target the space near the good-fitting parameter sets generated by random sampling, since we knew that both random search and SIR were missing potentially good-fitting parameter sets by nature of their somewhat sparse sampling strategy.
It was these drawbacks of SIR and random search that led us to explore IMIS, which adds an iterative directed search component to SIR. IMIS produced posterior parameter sets with higher likelihood and equal or better fit to the target distributions. However, it was the most complex algorithm, allowed for the least parallelization, and took the most time. IMIS required that initially billions and then hundreds of thousands of parameter sets be retained and sorted. IMIS on the scale we ran it would be virtually impossible without parallel computing and a high-performance server or similar (however, the same is true of our implementations of random search and SIR). On a high-performance computer cluster with parallel computing but limits on the volume of jobs that can be run in parallel, random search took about 1½ days, SIR took about 2½ weeks, and IMIS took almost 3 weeks (Table 3). These calibration techniques may not be feasible with more complex models that take longer to run, such as dynamic compartmental models governed by systems of differential equations, which are often used to model infectious diseases such as cholera and require the model to be independently burned in for every parameter set tried in calibration.
These are not the only potential drawbacks to using IMIS. As we found, without adjustments, IMIS may underestimate uncertainty because the algorithm stops searching for parameters in areas further away from the best-fitting parameter set. Similarly, IMIS performs less well when there are multiple optima rather than one global optimum, since it tends to settle in one area. We solved this problem by running multiple IMIS chains but found that the performance of IMIS was determined by how good the best parameter set from the initial sample was. This was especially true in our situation, in which one parameter set tended to have a much higher likelihood than the rest. Sampling more densely in the initial random sampling step addresses some of these issues. Finally, while the relative performance of different calibration methods found in this article may generalize to similar contexts with many degrees of freedom and wide search ranges, in other situations, random search or SIR may perform better. All 3 algorithms may perform well in identifiable models with lower-dimensional search spaces, and we do not intend to imply that IMIS is an all-around better calibration method than random search or SIR.
While this article explores several calibration techniques, there are other methods that we did not cover, such as MCMC and directed search techniques, that may also perform well under high uncertainty. While directed search may be quicker than random search, SIR, and IMIS in the context of large feasible ranges, the interpretation of uncertainty is not straightforward. Because directed search does not use a Bayesian framework, variation in parameter sets (e.g., generated by initiating directed search at different starting values) cannot be interpreted as parameter uncertainty and cannot be propagated in a model to estimate uncertainty around model results. The techniques we explored were also chosen because they did not require special software (as is often the case for MCMC) and all used a similar underlying model and calibration framework, which makes it feasible for a user to start from one technique and explore additional techniques as needed.
Importantly, none of these calibration techniques completely solves the problem of uncertainty and poor data. Calibration output will always be only as precise as the underlying data used as calibration targets. Decisions around prior or initial distributions, targets, goodness-of-fit functions, and tuning parameters, although based on available evidence and established best practices, are context specific, can feel arbitrary, and require both creativity and patience to see what works best in a particular situation. Underidentifiability remained a challenge for us across all 3 calibration methods because of our model structure and lack of additional data to use as calibration targets. Generally, underidentifiability would be addressed through the inclusion of additional targets (if available), narrower prior distribution ranges (if these are known), or fewer parameters (e.g., less age stratification), not calibration itself. 42
Approaches other than calibration have also been used to adjust for the biases associated with passive surveillance of infectious diseases. For instance, a hybrid surveillance approach that supplements passive surveillance estimates with data on health care utilization has been used for typhoid fever.46,47 However, in our case, data on health-seeking behavior and hospital admittance for cholera were unavailable, and even sources of information on care-seeking among adults with diarrheal diseases were lacking. Calibration appeared to be the best approach given the limitations of available data.
Ours is not the first model of cholera in Bangladesh. In a review of the recent literature, we found several studies28,48–56 that used models to simulate cholera cases and deaths in Bangladesh. Of those that reported comparable results or parameters, 4 of the models28,49,50,53 were calibrated to data from rural Bangladesh and may be less applicable to urban areas because of differences in treatment and access to clean water and sanitation and also because of variation in population density, social networks, and migration patterns. Underreporting of symptomatic cases is not accounted for in Longini et al., 50 and Dimitrov et al. 28 make the simplifying assumption that the case reporting rate is 10%. Khan et al. 56 used the reporting rate as an adjustment factor to apply the model from Dimitrov et al. to Dhaka, but the model estimated only reported cases and deaths associated with those cases, and converting a model of cholera in rural Bangladesh to an urban setting may require more adjustment than a simple multiplier. Most models are not age stratified or use simple adjustment factors to account for incidence differences across age groups. Our model is the first to explicitly model case severity and the impact of treatment on both case reporting and mortality.
It is also valuable to compare our results with those obtained by these other models. Although direct comparison between models is challenging because of their varying structure and outputs, we find that our model estimates slightly lower average underlying incidence than Longini et al. 50 (mean annual incidence was 11.2 cases/1000 compared to 8.9 cases/1000 in our model) but that the confidence intervals around our model’s outputs and theirs generally overlap. Dimitrov et al. 28 assumed that 10% of symptomatic cases were reported (25% in sensitivity analysis); through calibration, we found higher hospitalization rates for both mild and severe cases and thus estimate that a larger share of symptomatic cases is reported than assumed by Dimitrov. While models assessing the cost-effectiveness of cholera interventions often use parameters that are directly based on passive surveillance data or are widely used in the literature, we show through our comparisons that this approach should raise concerns about biased estimates.
Conclusion
We demonstrate that calibration is important and valuable in accurately modeling cholera in Bangladesh, since we found a greater underlying burden of cholera cases and fewer deaths than would be implied by extrapolating directly from epidemiological reports based on passive surveillance. This finding confirms prior suspicions that passive surveillance substantially underestimates the overall cholera incidence7,14 and signifies the value of modeling the underlying disease process and surveillance system. Choosing a calibration method that can account for biases due to passive surveillance is important in such settings; of those we evaluated, IMIS yielded the best results, balancing between covering the entire parameter space and using likelihood to focus in on parameter sets that are most consistent with the observed data. Our adaptations of the IMIS algorithm, which we make freely available, and our general approach to sampling from broad prior distributions in an attempt to account for possible biases, should be considered in many contexts where calibration to scarce or poor-quality data is a concern.
Supplemental Material
Ryckman_Cholera_Calibration_appendix.rjf_online_supp – Supplemental material for Methods for Model Calibration under High Uncertainty: Modeling Cholera in Bangladesh
Supplemental material, Ryckman_Cholera_Calibration_appendix.rjf_online_supp for Methods for Model Calibration under High Uncertainty: Modeling Cholera in Bangladesh by Theresa Ryckman, Stephen Luby, Douglas K. Owens, Eran Bendavid and Jeremy D. Goldhaber-Fiebert in Medical Decision Making
Footnotes
Acknowledgements
We thank Dr. Fernando Alarid-Escudero and Dr. Hawre Jalal for sharing their expertise and engaging in useful discussions around Bayesian calibration methodologies. Some of the computing for this project was performed on the Sherlock cluster. We would like to thank Stanford University and the Stanford Research Computing Center for providing computational resources and support that contributed to these research results.
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The work described in the article was completed at Stanford University and has previously been presented at the 2018 and 2019 annual Society for Medical Decision Making meetings. Theresa Ryckman received funding from the National Science Foundation’s Graduate Research Fellowship and the Stanford Graduate Fellowship in Science and Engineering. Dr. Owens was supported by the Department of Veterans Affairs and by grant 2R37DA01561216 from the National Institute on Drug Abuse. Financial support for this study was provided in part by grants and fellowships from the National Science Foundation, the Stanford Graduate Fellowship in Science and Engineering, the Department of Veterans Affairs, and the National Institute on Drug Abuse. The funders played no role in the study, and the funding agreements ensured the authors’ independence in designing the study, interpreting the data, writing, and publishing the report.
This work was supported by the National Science Foundation (grant DGE-1656518), the Stanford Graduate Fellowship in Science and Engineering, the Department of Veterans Affairs, and the National Institute on Drug Abuse (grant 2R37DA01561216). This material is based on work supported by the National Science Foundation Graduate Research Fellowship Program under grant DGE-1656518. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
