Estimating hidden populations by transferring knowledge from geographically misaligned levels

Abstract

The estimation of hidden sub-populations is a hard task that appears in many fields. For example, public health planning in Brazil depends crucially of the number of people who holds a private health insurance plan and hence rarely uses the public services. Different sources of information about these sub-populations may be available at different geographical levels. The available information can be transferred between these different geographic levels to improve the estimation of the hidden population size. In this study, we propose a model that use individual level information to learn about the dependence between the response variable and explanatory variables by proposing a family of link functions with asymptotes that are flexible enough to represent the real aspects of the data and robust to departures from the model. We use the fitted model to estimate the size of the sub-population at any desired level. We illustrate our methodology estimating the sub-population that uses the public health system in each neighborhood of large cities in Brazil.

Keywords

Bayesian statistics generalized linear model hidden population link function

1 Introduction

How to sample from hidden and elusive population has been investigated in the statistical literature for some time.1–5 Special sampling techniques for such kind of problem are needed to collect information about the population of interest. Similar sampling techniques are being used by computer scientists to sample social networks where the size and population structure are not easy to pre-determine.6–8

Estimating the size of sub-populations is an important and non-trivial task in many areas. In ecology, many studies try to estimate rare or endangered species population sizes.^9–11 he precise estimation of these species populations can help ecologists to better understand their behavior and characteristics. From a health-care perspective, it is also important to identify and determine hidden population size. Studies about rare populations can provide better understanding, insights in the improvement, or creation of specific public policies.12–19

Although direct information on elusive population can be hard to obtain, other sources of information may be available and used to infer about the hidden population size. The existence of data from misaligned geographical units can be used to transfer knowledge from an available data source to obtain more precise estimates of the population size of interest.

The aim of this study is to develop a statistical methodology to estimate an unknown, hard-to-reach, sub-population in small areas. The motivation for our work comes from our involvement in the InfoSAS, a large statistical project aimed at automatically detecting anomalies in the Brazilian public payment system to services providers. The detected cases should be audited subsequently. The InfoSAS procedure applies a set of statistical data analysis algorithms to monthly collected data covering about 5000 groups of medical procedures in 5570 municipalities and more than 6000 health-care providers. Each combination of medical procedure, municipality, and health-care provider produces a different time series of the monthly number of procedures performed. The total number of time series exceeds 30 million, a huge number that requires automatic and efficient time series procedures.²⁰ This large set of time series is screened for anomalies in the number of provided services by taking into account the estimated user population in the coverage area of the service provider.

InfoSAS requires an estimate of the sub-population that does not have access to private health insurance for different geographical levels. To understand the relevance of this number, one needs to know the size of both, the Brazilian private and public health systems. Brazil manages the largest public health system in the world, the Unified Health System (SUS) with more than 3.5 billion outpatient procedures and more than 12 million hospitalizations per year, spending nearly 26 billion reais (around 5.83 billion dollars) on these services. More than 220,000 establishments provide services to the SUS, including hospitals, clinics, and laboratories. The need for an effective regulation and control system to take care of such a system is obvious. This system exists and works in the various administrative spheres in which SUS is organized.

The private health insurance market in Brazil is also very large due to the inefficiencies of the public health sector, covering more than 50 million people and ranking the second biggest in the world after the United States, according to the Economist Intelligence Unit.²¹ The correct number of people with no private insurance plan is crucial to create health-care policies and resources management in different political scales, for example, municipalities, neighborhoods, census track, and so on.

In this study, we present a methodology that, although developed with this particular application in mind, is general enough to be applied to other situations. Features from a source of information in one geographic level are borrowed and projected to a different one. Thus, by using similar features available on different geographical levels, one can estimate the sub-population of interest over these levels.

For example, confidentiality requirements may impose the geographical limitation of public data release on HIV cases or political voting statistics. Data are usually available only at a minimum spatial aggregation level due to privacy protection concerns. Covariates at the same geographical level, as well as at smaller levels, can be used to explain the HIV or voting numbers at the coarser spatial level and then be transferred to the smaller level. One might be interested in knowing HIV numbers at a neighborhood level to guide the establishment of a new treatment center. Likewise, the election results in a more desegregated level could help political scientists to understand the areas in a city more likely to vote in a specific candidate.

Thus, we propose a method under the generalized linear regression framework^22,23 combined with a new family of link functions. In this link function, we introduce two new parameters, c and d, that reflects relevant aspects of our problem. The first one can be interpreted in terms of the proportion of individuals who will experience the event, independently of the covariates. Regarding the second one, the value of $1 - d$ represents the proportion of the individuals who will not experience the event, independently of the covariates. Different classes of link functions for binomial data have been proposed in the literature by many authors.^24–29 The literature of item response theory (IRT) takes advantage of different link functions to allow for more flexible models that are able to better reflect the characteristics of the real data.^30,31 The key advantage of this type of modeling is to realistically capture more complex relationship between the features and the response data. For example, heavy tail links allow for more robust inference in the presence of outliers, while asymptote terms allow for models that prevent the probability to go to zero or one on the extremes. With the proposed methodology, we transfer knowledge from an individual-based data from the National Household Sample Survey (PNAD) to estimate the SUS population at a neighborhood level in large cities in Brazil.

This paper is organized as follows: In Section 2, the methodology is proposed. In Section 3, Bayesian inference is discussed and the FLAMES package introduced. A simulation study is presented in Section 4. In Section 5, an application is presented to estimate the SUS population size for all neighborhoods of Belo Horizonte, Brazil. Finally, Section 6 ends with a discussion of the model and the obtained results.

2 Methodology

2.1 Geographically misaligned information

Demographic information is an essential ingredient in epidemiological studies since it provides estimates of the risk population. However, many times population data for small geographical regions are scarce or only occasionally available as when demographic census are carried out. Hence, research studies using such small geographical areas as the unit of analysis may not be feasible.

From a public policy perspective, to know the characteristics of small geographical regions is necessary, since it allows the decision making to be specific and precise for the region of interest. One may use data from a larger region where the small unit is located, but the quality of the approximation depends on the within region variability. This quality was anticipated to be very low in our specific application.

To work around this problem, it is possible to perform a knowledge transfer between the information available at different geographical levels. Figure 1

Figure 1.

Knowledge transfer for the misaligned geographical level scheme. (a) Individual level, (b) outcome in the areal level, (c) covariate at the misaligned level, and (d) estimation at the misaligned level.

shows how this can be done by collecting individual-level information from a sample spread on a larger region, the Minas Gerais (MG) state in Brazil, a region about the same extent as France in Europe. Each sampled individual is shown in his map location in Figure 1(a) as a dot or a cross according to Y, a binary outcome of interest, such as having a private insurance plan or not, respectively. At this level, we have covariates X that can be used to make a connection between the outcome Y and the covariates as in usual statistical modeling. Occasionally, we use the term features as a synonymous for covariates. The Y symbols are shown with sizes proportional to the value of one such covariate in Figure 1(a).

We use the knowledge about the relationship between covariates and outcome to estimate the smoothed outcome in all space levels. In Figure 1(b), we apply the acquired knowledge to fit our outcome given the same or similar covariates used in the first step. In this step, it is important to verify whether the fit is satisfactory by comparing the smoothed outcome with real values by means of a validation set, for example. With an accurate fit, we can proceed with the use of the model to transfer knowledge for different geographical levels.

First, it is necessary to find similar covariates at the desired smaller geographical level. This is shown in Figure 1(c), the map of Belo Horizonte, a large city with more than 2 million inhabitants but covering a tiny region, less than 0.06% of MG state territory. We apply the model again to estimate the hard-to-reach population at the misaligned small geographical level (Figure 1(d)). This process can be repeated many times to different geographical levels as long as they share similar covariates and there is at least one level where the data of interest are available and one where the model goodness of fit can be verified (with a validation set).

The choice of the inference model can vary according to the application. For this work, we propose a flexible and robust link model for binomial data that can easily be generalized for any hierarchical modeling structure.

2.2 Flexible link function with asymptotes

Generalized linear model (GLM)^22,23 is a broad modeling class. The idea is to explain the average response of a variable of interest using some features. First, an adequate probability distribution is defined for the variable of interest (response variable), after, a link function is defined to map the features to the conditional mean of the selected distribution.

To model the size of the sub-populations, we focus in a robust GLM approach for binomial data with a link function that is capable of modeling an asymptote behavior as well as accommodate the presence of outliers or asymmetry. For each region i in the study, there is a total population n_i and we need to determine how many of them belong to the population of interest. A special case of the binomial regression is the binary regression where n_i = 1. In general, the model can be represented as

\begin{array}{l} Y_{i} | p_{i} \sim Binomial (Y_{i}; n_{i}, p_{i}) \\ g (p_{i} | β) = β_{0} + β_{1} \times X_{1 i} + \dots + β_{k} \times X_{k i} \end{array}

the parameter β_j measures the importance of each of the k covariates explaining the behavior of the average response. Y_i is the response variable and can take values in

{0, 1, \dots, n_{i}}

. The function

g (\cdot)

is known as link function and plays a main role in modeling. This function maps the probability of success in the binomial distribution to the real line, in other words,

g : p_{i} \to ℝ

. Since for the binomial regression

p_{i} \in (0, 1)

, any cumulative distribution function (c.d.f) that has its support defined in the real line

ℝ

can be used to make this map. If F is a c.d.f with support in the real line, then

F^{- 1} : (0, 1) \to ℝ

can be a link function.

The most common link functions are the logistic and probit, and they are obtained by selecting the c.d.f to be the standard logistic and standard normal, respectively. These link functions are symmetric around its center value due to the nature of its distribution function and are not adequate when the data present different rates to approach the extreme values 0 or n_i of the response variable. Other c.d.f. can be used as link functions, for example, the Cauchy, student-t, Gumbel, and so on. These distributions have other characteristics that make them more robust and attractive to model data. The Cauchy (cauchit link) and student-t (robit link) are also symmetric but have heavier tails, and thus are more robust to the presence of outliers. The Gumbel generates an asymmetric link function called cloglog. Since it is asymmetric, it can approach the extremes with different rates what is commonly observed in real data. Another important change is borrowed from the literature of IRT models. Barton and Lord³¹ include the use of asymptotes in the link function. In our case, we define the new link function, for binomial regression, as

g^{⋆} (p_{i} | β) = c + (d - c) g (p_{i} | β)

(1)

In this model, c represents the lower asymptote value and d is the upper asymptote value, with $0 \leq c < d \leq 1$ .

In Figure 2

Figure 2.

Link functions and types of asymptotes. (a) Examples of link functions and (b) examples of link functions with asymptotes.

(a), we can see different link functions and its forms. Notice that only the cloglog presents asymmetry, all other links are symmetric around its central point, with different growth rate. Figure 2(b) shows how the asymptote parameters affect the link function on its extremes. When c > 0 (dash-dotted line), the link function does not converge to 0 when

x \to - \infty

, but to c. If d < 1 (dashed line), the function converges to d when

x \to + \infty

and, as demonstrated by the dotted line, the method allows for the existence of both asymptotes.

3 Bayesian inference

To fit the model proposed in Section 2.2, a hierarchical Bayesian structure is introduced. To allow reproducibility and provide access for a wide range of practitioners, an R package has been created that can be installed following the instructions in the FLAMES: Flexible Link function with AsyMptotES repository https://github.com/douglasmesquita/FLAMES.

In a Bayesian setting, the posterior distribution is proportional to the product of the likelihood and the prior distribution of the parameters, as presented in equation (2)

π (β, c, d, ν, λ | y) \propto π (y | β, c, d, ν) π (β) π (c, d) π (ν | λ) π (λ)

(2)

in which

π (β, c, d, ν, λ | y)

is the posterior distribution to our set of unknown parameters,

π (y | β, c, d, ν)

represents the likelihood,

π (c, d)

is a joint prior distribution for c and d respecting the constraint

0 \leq c < d \leq 1, π (ν | λ)

is the prior distribution for ν, and

π (λ)

is an hyperprior for λ.

The hierarchical Bayesian representation of the model is given by

\begin{array}{l} Y_{i} | p_{i} \sim Bernoulli (Y_{i}; p_{i}) \\ g (p_{i} | β, c, d, ν) = β_{0} + β_{1} \times X_{1 i} + \dots + β_{k} \times X_{k i} \end{array}

where the likelihood is Bernoulli and

g (p_{i} | \cdot)

represents any suitable link function. A third level is necessary to specify the priors of β, the asymptotes (c, d), and, in the case of robit link function, the prior for degrees of freedom (ν). For the degrees of freedom, we still have a fourth level to assign an hyperprior for the λ parameter.^32–35

The implementation of FLAMES uses the following prior and hyperprior distributions

\begin{array}{l} β \sim N_{p} (0, σ_{β}^{2} I) \\ c \sim Beta (a_{c}, b_{c}); d | c \sim Beta (a_{d}, b_{d}) I (d > c) \\ ν \sim Exp (λ) \\ λ \sim Uniform (a_{λ}, b_{λ}) \end{array}

where

σ_{β}^{2}

is the prior variance for β and I is the identity matrix with appropriate dimension. The

π (c, d)

distribution is given by

π (c, d) = π (d | c) π (c)

where the indicator function in

π (d | c)

is needed such that

g^{- 1} (.)

is a non-decreasing function and a cumulative distribution function. The uniform distribution assigned to λ must cover reasonable values such that the mean of ν has a range to allow many different important degrees of freedom.

Several link functions are available in the FLAMES package. The complete list is detailed in Table 1

Table 1.

Link functions and its characteristics.

Name	Expression	Asymmetry	Heavy tail
logit	$g (x) = \log (\frac{p}{1 - p})$	No	No
probit	$g (x) = Φ^{- 1} (x)$	No	No
cloglog	$g (x) = \log (- \log (1 - x))$	Yes	No
loglog	$g (x) = \log (- \log (x))$	Yes	No
cauchit	$g (x) = F_{t_{1}}^{- 1} (x)$	No	Yes
robit	$g (x) = F_{t_{ν}}^{- 1} (x)$	No	Yes

as well the expressions and some desired characteristics.

The link function selection is tricky and the best way to select one of them is by testing. However, we can guess what is the best link function looking at the problem and associated data set. If there is no evidence of asymmetry, we would prefer a symmetric link function such as the logit or probit. If logit and probit do not work well in this symmetric case, a link function that deals with heavy tail, such as the robit, is recommended. In the asymmetric case, the cloglog link function should be considered.

In Table 1 $Φ^{- 1} (.), F_{t_{1}} (.)$ and $F_{t_{ν}}^{- 1}$ are the inverse c. d. f. of a normal distribution, t-Student with 1 degree of freedom and t-Student with ν degrees of freedom, respectively.

Since the posterior distribution has no closed-form expression, a Gibbs sampler cannot be directly used and another Bayesian sample method is needed to get the posterior estimates. Several approaches are available in the literature and in the FLAMES package two of them are available: (1) adaptive rejection metropolis sampling—ARMS³⁶ and (2) Metropolis–Hastings algorithm.^37,38 The ARMS method is slower but simpler to tune. Also, in general, a smaller chain is needed to achieve convergence. The Metropolis–Hastings algorithm is faster although correct tuning of the proposal distribution is necessary to achieve convergence.

4 Simulation study

This study aims at showing the flexibility and robustness of the proposed methodology. We make the data-generating model more complex at each step, and we show that our proposal model is capable of modeling simpler scenarios as well as the simpler link functions. When moving to the more complex ones, similar to what is usually observed in real data sets, our model has clear advantages over the alternatives.

We use the cloglog link function since it is asymmetric and is the one selected in our application. To perform the study, we considered four scenarios: (1) the conventional model, with c = 0 and d = 1; (2) a model with a fixed minimum proportion of cases, implying on c = 0.20 and d = 1; (3) a model with a maximum proportion of non-cases, implying on c = 0 and d = 0.95; and (4) a model with a minimum and maximum proportion of cases and non-cases, respectively, implying on c = 0.20 and d = 0.95. In all scenarios, we are taking $β_{0} = 0, β_{1} = - 1$ and $β_{2} = 0.5$ .

For each model, we generated 1000 data sets and for each data set, the four data-generating models were considered. Our goal is to show that our proposed model, the one with $c \geq 0$ and $d \leq 1$ , always recover the true characteristics of the generating one, thus providing a good fit for all scenarios, starting from the simpler scenario to the most complex one.

Table 2

Table 2.

Posterior means (standard deviations) based on 100 data sets considering c = 0.00 and d = 1.00 as generating model.

Model	β ₀	β ₁	β ₂	c	d
cloglog	−0.01 (0.05)	0.49 (0.05)	−1.02 (0.07)
cloglog (c)	−0.01 (0.06)	0.50 (0.05)	−1.03 (0.07)	0.00 (0.01)
cloglog (d)	−0.00 (0.05)	0.49 (0.05)	−1.02 (0.07)		1.00 (0.00)
cloglog (c, d)	−0.01 (0.06)	0.50 (0.05)	−1.04 (0.08)	0.01 (0.01)	1.00 (0.00)

Bold values are the results of the proposed model.

shows that, under the simplest model, the estimate of the coefficients are reasonable, independently of the model. The coefficients c and d were estimated around 0 and 1 as expected, showing that there is no asymptotes for this set of data sets.

Table 3

Table 3.

Posterior means (standard deviations) based on 100 data sets considering c = 0.20 and d = 1.00 as generating model.

Model	β ₀	β ₁	β ₂	c	d
cloglog	0.27 (0.05)	0.33 (0.04)	−0.65 (0.05)
cloglog (c)	−0.04 (0.13)	0.51 (0.08)	−1.03 (0.15)	0.21 (0.05)
cloglog (d)	0.27 (0.05)	0.33 (0.04)	−0.65 (0.05)		1.00 (0.00)
cloglog (c, d)	−0.03 (0.12)	0.52 (0.08)	−1.05 (0.15)	0.22 (0.05)	1.00 (0.00)

Bold values are the results of the proposed model.

shows that, under the model with a minimum proportion of cases, models without the asymptote fail to estimate the coefficients correctly (cloglog and cloglog (d)). With our proposed model, c and d were estimated around 0.20 and 1 as expected showing that the use of the asymptote for this scenario was successful.

Table 4

Table 4.

Posterior means (standard deviations) based on 100 data sets considering c = 0.00 and d = 0.95 as generating model.

Model	β ₀	β ₁	β ₂	c	d
cloglog	−0.18 (0.05)	0.37 (0.04)	−0.70 (0.04)
cloglog (c)	−0.19 (0.05)	0.37 (0.04)	−0.70 (0.04)	0.00 (0.00)
cloglog (d)	−0.00 (0.08)	0.52 (0.06)	−1.01 (0.09)		0.95 (0.02)
cloglog (c, d)	−0.01 (0.08)	0.52 (0.06)	−1.02 (0.09)	0.00 (0.01)	0.95 (0.02)

Bold values are the results of the proposed model.

shows that under the model with a maximum proportion of non-cases, models without the asymptote fail to estimate the coefficients (cloglog and cloglog (c)). Our proposal continues robust with c and d estimated around 0 and 0.95 as expected.

Finally, Table 5

Table 5.

Posterior means (standard deviations) based on 100 data sets considering c = 0.20 and d = 0.95 as generating model.

Model	β ₀	β ₁	β ₂	c	d
cloglog	0.08 (0.05)	0.24 (0.03)	−0.44 (0.04)
cloglog (c)	0.04 (0.07)	0.26 (0.04)	−0.48 (0.05)	0.02 (0.04)
cloglog (d)	0.21 (0.07)	0.31 (0.04)	−0.58 (0.06)		0.97 (0.02)
cloglog (c, d)	−0.06 (0.15)	0.56 (0.13)	−1.06 (0.23)	0.22 (0.06)	0.95 (0.02)

Bold values are the results of the proposed model.

shows that, under the most complex model, our proposal is able to recover all parameters in a satisfactory way and that c and d were estimated around 0.20 and 0.95 as expected.

As a general remark, we can see that, in all scenarios, the proposed model with two asymptotes provides good estimates for both regression coefficients and the asymptotes, showing robustness to adapt to the characteristics of the true generating model without any inferential loss. Recovering the fixed effects values correctly is important when interpreting the model and understanding the relationship between the response and features. Therefore, because of this flexibility without inferential loss, we can conclude that the use of the model with two asymptotes is preferable when compared with any simpler version. We proceed now in our analysis using only the flexible link $g (\cdot)$ in equation (1).

5 Estimating the SUS population

The SUS has the objective to offer all Brazilians a free health-care system. Due to the inefficiencies of the SUS system and other reasons, many Brazilians prefer to use private health insurance. Because of that, not all the Brazilian population is a SUS user. Because of the high cost of the system, around 5.83 billion dollars annually, it is of great interest to have a good estimate of the SUS population. We want to estimate the proportion of people in a given small area that relies on the SUS system for their health needs. It is speculated that around 90% of the Brazilian population takes advantage of the system, but this is extremely variable, reaching 100% in poor areas and less than 50% in the more affluent areas.

Given the social economics characteristics of the health-care system, it is natural to believe that individuals with higher income are capable of paying private health insurance and therefore use less the government health system. Also, men and women may present different behavior when dealing with health, therefore having a different usage of the SUS system. Based on these premises, we can think that the average income and percentage of males are important features to predict the probability of a person to have health insurance.

Yearly, the National Agency for Supplementary Health Services (ANS), agency that regulates health insurance in Brazil, provides the number of users of health plans in each Brazilian municipality. The main objective of this application is to estimate the SUS population in smaller geographical levels than municipalities for the main cities in Brazil. In our analysis, we focus in the estimation of the SUS population for the neighborhood level. This information is essential to regulate and create local public policies for the SUS system.

5.1 Available data

As the ANS provides the number of individuals with health insurance for all Brazilian municipalities, it is possible to estimate the SUS population as the difference between the total population and the health insurance users for each municipality. However, for smaller geographical levels, the number of individuals with health insurance is unknown.

To learn how income and sex are related with health plan insurance, the 2008 PNAD survey was used. In this study, a sample of around 40,000 Brazilian households was taken, and information about income and health insurance status were recorded. To estimate the SUS population for each neighborhood, we use the average neighborhood income and the percentage of males obtained by the Brazilian Census,³⁹ and the GLM accuracy was verified using the ANS information at the municipality level.

5.2 SUS population modeling

As described in Section 2, we present a model to estimate the SUS population. First, we model the relationship between the individuals without health insurance and their respective income and gender. With this information, the proposed binomial model with flexible link function is used to estimate the SUS population at each municipality. The flexibility and robustness of the proposed link function allow a good fit and prediction even using a very small number of covariates, as will be seen in the application. Since the SUS population is available at the municipality level by the ANS agency, it will be used to validate if the estimation of the proposed methodology is adequate. Because of the very large demographic, economic and cultural differences in Brazil, a model was fitted for each one of the 27 states, indexed by $j = 1, \dots, 27$ . Using the 2008 PNAD survey, let Y_ji = 1, if the ith person living at state j does not have private health insurance, and 0, otherwise. The following model was fitted

\begin{array}{l} Y_{j i} | p_{j i} \sim Bernoulli (Y_{j i}; p_{j i}) for i = 1, \dots, n_{j} \\ p_{j i} | β = c_{j} + (d_{j} - c_{j}) g (β_{0 j} + β_{1 j} \times I n c o m e_{j i} + β_{2 j} \times S e x_{j i}) \end{array}

where g represents an appropriate link function, Income_ji is the income of that household (scaled), and Sex_ji is the gender of the household respondent. This step corresponds to Figure 1(a).

With the estimates of ${\hat{c}}_{j}, {\hat{d}}_{j}, {\hat{β}}_{0 j}, {\hat{β}}_{1 j}$ and ${\hat{β}}_{2 j}$ at hand, one can proceed to estimate the SUS population at any level $k \in j$ as

{\overset{P o p}{}}_{j k} = [{\hat{c}}_{j} + ({\hat{d}}_{j} - {\hat{c}}_{j}) g ({\hat{β}}_{0 j} + {\hat{β}}_{1 j} \times I n c o m e_{j k} + {\hat{β}}_{2 j} \times S e x_{j k})] \times n_{j k}

where Income_jk is the average income (scaled), and Sex_jk is the proportion of males of region jk. This step corresponds to Figure 1(b).

Figure 3

Figure 3.

Model fiting for Minas Gerais and São Paulo. Dashed lines represent the estimate of c for the two best models. SUS: Unified Health System.

shows the fit against the income variable for the two most populated states of Brazil: MG and São Paulo (SP). We also show an empirical estimate based on the simple proportion of people with no private insurance in each income level for each state. As can be seen, the proposed methodology provides an adequate link function to fit the data. The empirical curve (light gray) shows that, even for people having high income, the probability that an individual uses the SUS system is higher than zero, while for low income, this probability is very high. This emphasizes the need of asymptotes in modeling. Also, as will be seen from Table 6

Table 6.

Estimates of the model parameters for the 10 most populous states in Brazil.

UF	Model	c	d	β ₀	Income (scaled)	Sex (male)	WAIC
BA	Cloglog	0.18	1.00	0.24	−2.13	0.04	23,993.07
CE	cloglog	0.18	1.00	0.30	−1.93	0.05	16,758.43
MG	cloglog	0.18	1.00	−0.33	−2.42	0.06	34,982.28
PA	cloglog	0.25	1.00	0.22	−1.63	0.05	15,387.96
PE	cloglog	0.16	1.00	0.22	−3.50	0.05	17,086.31
PR	cloglog	0.21	0.99	−0.29	−1.96	0.08	18,176.87
RJ	cloglog	0.12	0.98	−0.32	−2.10	0.02	24,801.03
RS	cloglog	0.18	1.00	−0.56	−2.35	0.03	29,875.04
SC	cloglog	0.26	0.98	−0.26	−1.89	0.04	9237.52
SP	cloglog	0.14	1.00	−0.63	−2.22	0.07	46,184.71

CE: Ceará; PE: Pernambuco; RJ: Rio de Janeiro; PR: Paraná; PA: Pará; RS: Rio Grande do Sul; BA: Bahia; UF: Unidade da Federação; WAIC: Widely Applicable Information Criterion; SP: São Paulo; MG: Minas Gerais; SC: Santa Catarina.

, there is asymmetry in the data and the necessity to use an asymmetric link function.

Table 6 provides the parameter estimation for the 10 most populated states in Brazil. For all states, the cloglog link provided the best fit in comparison with the logit, probit, cauchit, and robit links. This is a clear indication that the data are asymmetric. Also in all scenarios, parameter c is significantly different from 0 while d is smaller than 1 for Rio de Janeiro and Santa Catarina. This is another evidence in addition to the one presented in Section 4 showing that the proposal is flexible and able to accommodate the different characteristics of the data. The regression parameter $β_{1} < 0$ indicates that high income are associated with a lower probability of the individual to use the SUS system while $β_{2} > 0$ indicates that men have higher probability of using the system. Although, the income and sex parameters are stable through all states, the asymptotes parameters change in a larger scale showing the need of a fit for each state separately instead of a global model, in order to respect the local cultural and social diversity.

The c and d parameters have direct and interesting interpretation: c represents the percentage of the richer population that uses the SUS system, while $1 - d$ represents the percentage of poorer people that does not use it. As expected, almost all poor Brazilian relies on the system for its health care. Out of the richer population, 15% to 25% seems to rely on the system. This discovery can be explained mainly by three factors: (1) We imagine that the probability of using the public health system decreases with income. However, we found that there is a certain proportion of higher-income people who do not have private health insurance and makes use of the SUS service; (2) as some private health insurance plans do not cover highly complex procedures, their users seek SUS services for care; and (3) those who are wealthy enough to pay for private medical care expenses without any health insurance.

Another important aspect is to compare the models by using a goodness of fit measures under different link functions. The widely applicable information criterion (WAIC)⁴⁰ allows us to perform such comparison and is returned by FLAMES with other Bayesian comparison criteria. Using the WAIC the cloglog link was preferable for the different states. Table 7

Table 7.

Fit measures for MG under several models.

UF	Model	c	d	df	β ₀	Income (scaled)	Sex (male)	WAIC
MG	cloglog	0.18	1.00		−0.33	−2.42	0.06	34,982.28
MG	probit	0.22	1.00		−0.02	−2.47	0.08	35,083.28
MG	robit	0.22	1.00	72.75	−0.02	−2.51	0.08	35,085.57
MG	logit	0.22	1.00		−0.05	−4.13	0.14	35,102.08
MG	cauchit	0.21	1.00		−0.08	−4.14	0.17	35,303.60

UF: Unidade da Federação; WAIC: Widely Applicable Information Criterion; MG: Minas Gerais.

presents the fit of the different link function for the state of MG as an example and its respective WAIC. As can be seen, the clolog link is the one with smaller WAIC and thus the preferred model. Notice that the c parameters is very similar between the symmetric links and different in the cloglog one and that d present the same behavior for all link functions. This difference in c can be explained by the fact that the cloglog is an asymmetric function and more realistically adapts to the characteristic of this data set. As can be seen in the left-hand side of Figure 3,

c \approx 0.22

is an overestimation of the real proportion of SUS users for high income. The same is observed for the SP state where the symmetric links overestimate the c value. The robit have the degrees of freedom (df) estimated as 72.75 which is an indication that the data do not present outliers since for large values of df, the robit link behaves very similar to the probit link while with smaller values it presents heavier tail behavior.

Figure 4

Figure 4.

Estimated SUS population estimation and ANS information (log scale). ANS: National Agency for Supplementary Health Services; GLM: generalized linear model; SUS: Unified Health System.

presents the estimated SUS population in comparison to the ANS data in the logarithm scale. We can see that even with the possibility of missing important features, the proposed approach is very flexible and robust providing an unbiased and accurate estimation of the SUS population at the municipality level.

With the parameters of the model estimated and the verification obtained in Figure 4, it is possible to transfer knowledge to estimate the SUS population to any smaller geographical level of interest. Specifically, the knowledge transfer between levels, like from a municipality level to a neighborhood one, is of extreme importance for health-care policies makers. This step corresponds to Figure 1(c) and (d).

Table 8

Table 8.

Neighborhoods in Belo Horizonte and its respective income average, population estimated SUS population, and estimated proportion of SUS users.

Neighborhood	Income average	Male (%)	Population	SUS population	SUS proportion (%)
Mangabeiras	R$8.797,77	47	1947	354	18.18
Savassi	R$6.326,31	44	11,772	2136	18.14
Ouro Preto	R$2.335,97	48	17,255	3258	18.88
Betânia	R$1.443,07	46	12,054	3914	32.47
Rio Branco	R$y1.225,36	47	12,768	5802	45.44
Milionário	R$1.154,41	49	12,175	6236	51.22
Céu Azul	R$1.051,86	48	23,817	14,468	60.75
Lindéia	R$y915,24	48	24,146	18,023	74.64
Mantiqueira	R$828,54	48	20,282	16,841	83.03
Jardim Felicidade	R$683,08	48	15,486	14,513	93.72
Vila Batik	R$540,20	48	192	190	98.96

SUS: Unified Health System.

shows the SUS population estimation for some neighborhoods in the city of Belo Horizonte, capital of MG. As expected, neighborhoods with higher income and lower number of men have proportionally smaller estimated SUS population. This result agrees with specialists expectation, and the model is applied to all neighborhoods of the main cities in Brazil for use in the InfoSAS project. It is also relevant to note the gender effect. For example, Savassi is an upper class neighborhood in the city; however, its male percentage is lower if compared to Mangabeiras. Therefore, its estimated SUS population proportion is smaller in comparison with Mangabeiras which has the highest income but also a higher percentage of men. Another important aspect to emphasize is that a naive approach of dividing the SUS population of each municipality based only on neighborhood population size will not provide a good estimate of the true sub-population since it completely ignore the demographic characteristics of each neighborhood.

6 Conclusion

Demographic information is useful to guide public health decision making. Many times, the demographic information is available only at some geographical levels making it hard to carry out analysis in other geographical levels of interest. However, the information available at one geographic level can be used to construct a prediction model that uses common features to all levels to transfer information between the available data and the geographical level of interest.

The prediction model can be constructed using the methodology that combines features with the response variable. In this work, we propose a flexible GLM that combines flexible link functions with the use of asymptotes correction. The FLAMES R package is available at https://github.com/douglasmesquita/FLAMES to fit the proposed class of flexible link functions using Bayesian inference. This family of link functions is very flexible and capable of accommodating asymmetry, heavy tail, and asymptotes, characteristics commonly observed in real data sets.

In this study, the proposed methodology is used to determine the Brazilian SUS population for the neighborhoods of the main cities in Brazil. This local information is vital for the InfoSAS project, and it is important for public agencies because it allows for local policies aimed to the population that actually takes advantage of the system. For example, where to create a new hospital to improve life quality and reduce costs. Another important characteristics of the proposed methodology is its interpretability where c is the proportion of higher income people that does use the SUS system and $1 - d$ is the proportion of poorer people that does not use the SUS system.

To estimate the SUS population at the neighborhood level, the PNAD survey was used to link the average income and percentage of males with the health insurance status by individuals. The found relationship confirms the belief that individuals with higher income tend to have private health insurance and that males use the public system more often than females. Given the cultural and social economic diversity of Brazil, the model was fitted separately for each Brazilian state. The ANS annually releases the number of people with health insurance in Brazil for the municipality level. With this information available, it was possible to verify the capability of the model of unbiased and precisely estimate the underlying SUS population.

Finally, the link function with asymptotes was capable of adapting to the asymmetry and characteristics observed in the data. With this flexible model, we were able to transfer the acquired knowledge from the individual and municipality levels to estimate the SUS population of each Brazilian neighborhood in the main cities. This result can assist public agencies in making optimum decisions about public health policies.

Although the proposed family of link function is very flexible, generalized linear mixed models can be used to improve fit and prediction under different scenarios, for example, spatial modeling, multilevel models. However, we believe that such improvements are out of the scope of this study, and we let the inclusion of other features as future work.

Footnotes

Acknowledgments

The authors would like to thank the InfoSAS project members for the discussion that helped creating the proposed methodology. Also, the authors would like to thank Coordenação de Aperfeiçoamento de Pessoal de Nível Superior, Conselho Nacional de Desenvolvimento Científico e Tecnológico, and Fundação de Amparo à Pesquisa do Estado de Minas Gerais for partial financial support.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Kish

. A taxonomy of elusive populations. J Off Stat 1991; 7: 340–347.

Heckathorn

. Respondent-driven sampling: a new approach to the study of hidden populations. Soc Prob 1997; 44: 174–199.

Sudman

Sirken

Cowan

. Sampling rare and elusive populations. Science 1988; 240: 991–996.

Scharping

. Hide-and-seek: China’s elusive population data. China Econ Rev 2001; 12: 323–332.

Pratesi

Rocco

. Centre sampling for estimating elusive population size. Statistica 2002; 62: 745–757.

Berg

. Snowball sampling – I, Encyclopedia of statistical sciences, vol. 12. Hoboken: John Wiley & Sons, 2004.

Gjoka M, Kurant M, Butts CT, et al. Walking in Facebook: A case study of unbiased sampling of OSNS. In: INFOCOM, proceedings IEEE (pp. 1–9). Piscataway: IEEE, 2010.

Baltar

Brunet

. Social research 2.0: virtual snowball sampling method using Facebook. Internet Res 2012; 22: 57–74.

Banks

Piggott

Hansen

, et al. Wombat coprogenetics: enumerating a common wombat population by microsatellite analysis of faecal DNA. Aust J Zool 2002; 50: 193–204.

10.

Eggert

Woodruff

. Estimating population sizes for elusive animals: the forest elephants of Kakum National Park, Ghana. Mol Ecol 2003; 12: 1389–1402.

11.

Miller

Joyce

Waits

. A new method for estimating the size of small populations from genetic mark–recapture data. Mol Ecol 2005; 14: 1991–2005.

12.

Brugal

Domingo-Salvany

Maguire

, et al. A small area analysis estimating the prevalence of addiction to opioids in Barcelona, 1993. J Epidemiol Commun Health 1999; 53: 488–494.

13.

Hsieh

Chen

CWS

Lee

. Empirical Bayes approach to estimating the number of HIV-infected individuals in hidden and elusive populations. Stat Med 2000; 19: 3095–3108.

14.

Aaron

Chang

Markovic

, et al. Estimating the lesbian population: a capture-recapture approach. J Epidemiol Commun Health 2003; 57: 207–209.

15.

Salganik

Fazito

Bertoni

, et al. Assessing network scale-up estimates for groups most at risk of HIV/AIDS: evidence from a multiple-method study of heavy drug users in Curitiba, Brazil. Am J Epidemiol 2011; 174: 1190–1196.

16.

Toledo

Codeço

Bertoni

, et al. Putting respondent-driven sampling on the map: insights from Rio de Janeiro, Brazil. J Acquir Immune Defic Syndr 2011; 57: S136–S143.

17.

Feehan

Salganik

. Generalizing the network scale-up method: a new estimator for the size of hidden populations. Sociol Methodol 2016; 46: 153–186.

18.

Johnston

McLaughlin

Rouhani

, et al. Measuring a hidden population: a novel technique to estimate the population size of women with sexual violence-related pregnancies in South Kivu Province, Democratic Republic of Congo. J Epidemiol Global Health 2017; 7: 45–53.

19.

Crawford

Heimer

. Hidden population size estimation from respondent-driven sampling: a network approach. J Am Stat Assoc 2018; 113: 755–766.

20.

Carvalho

Meira

Jr Prates

, et al. Infosas: um sistema de mineração de dados para controle da produção do sus. Revista do TCU 2016; 137: 52–59.

21.

Massuda

Hone

Leles

FAG

, et al. The Brazilian health system at crossroads: progress, crisis and resilience. BMJ Global Health 2018; 3: e000829.

22.

Nelder

Wedderburn

. Generalized linear models. J R Stat Soc Ser A 1972; 135: 370–384.

23.

Murphy

. Machine learning: a probabilistic perspective. Cambridge: The MIT Press, 2012.

24.

Nagler

. Scobit: an alternative estimator to logit and probit. Am J Polit Sci 1994; 38: 230–255.

25.

Chen

Dey

Shao

. A new skewed link model for dichotomous quantal response data. J Am Stat Assoc 1999; 94: 1172–1186.

26.

Bazán

Bolfarine

Branco

. A framework for skew-probit links in binary regression. Commun Stat Theory Methods 2010; 39: 678–697.

27.

Jiang

Dey

Prunier

, et al. A new class of flexible link functions with application to species co-occurrence in cape floristic region. Ann Appl Stat 2013; 7: 2180–2204.

28.

Bazán

Romeo

Rodrigues

. Bayesian skew-probit regression for binary response data. Braz J Prob Stat 2014; 28: 467–482.

29.

Wang

Lin

, et al. Flexible link functions in nonparametric binary regression with Gaussian process priors. Biometrics 2016; 72: 707–719.

30.

Birnbaum A. Some latent trait models and their use in inferring and examinee's ability. In: Loed FM, Lord MR (eds), Novick, statistical theories of mental test scores. Reading: Mass. Addison Wesley, 1968, pp.17--20.

31.

Barton

Lord

. An upper asymptote for the three-parameter logistic item-response model. ETS Res Rep Ser 1981; 1981: i–8.

32.

Fernández

Steel

. On Bayesian modeling of fat tails and skewness. J Am Stat Assoc 1998; 93: 359–371.

33.

Congdon

. Bayesian models for categorical data. Hoboken: John Wiley & Sons, 2005.

34.

Cabral

CRB

Lachos

Madruga

. Bayesian analysis of skew-normal independent linear mixed models with heterogeneity in the random-effects population. J Stat Plan Inference 2012; 142: 181–200.

35.

Garay

Bolfarine

Lachos

, et al. Bayesian analysis of censored linear regression models with scale mixtures of normal distributions. J Appl Stat 2015; 42: 2694–2714.

36.

Gilks

Best

Tan

. Adaptive rejection metropolis sampling within Gibbs sampling. J R Stat Soc Ser C (Appl Stat) 1995; 44: 455–472.

37.

Metropolis

Rosenbluth

, et al. Equation of state calculations by fast computing machines. J Chem Phys 1953; 21: 1087–1092.

38.

Hastings

. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 1970; 57: 97–109.

39.

Brazilian Institute of Geography and Statistics (IBGE). Brazil Demographic Census 2010. Rio de Janeiro, Brazil: Brazilian Institute of Geography and Statistics (IBGE), 2012.

40.

Watanabe

. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. J Mach Learn Res 2010; 11: 3571–3594.