Refusal bias in HIV data from the Demographic and Health Surveys: Evaluation,critique and recommendations

Abstract

Non-response is a commonly encountered problem in many population-based surveys. Broadly speaking, non-response can be due to refusal or failure to contact the sample units. Although both types of non-response may lead to bias, there is much evidence to indicate that it is much easier to reduce the proportion of non-contacts than to do the same with refusals. In this article, we use data collected from a nationally representative survey under the Demographic and Health Surveys program to study non-response due to refusals to HIV testing in Malawi. We review existing estimation methods and propose novel approaches to the estimation of HIV prevalence that adjust for refusal behaviour. We then explain the data requirement and practical implications of the conventional and proposed approaches. Finally, we provide some general recommendations for handling non-response due to refusals and we highlight the challenges in working with Demographic and Health Surveys and explore different approaches to statistical estimation in the presence of refusals. Our results show that variation in the estimated HIV prevalence across different estimators is due largely to those who already know their HIV test results. In the case of Malawi, variations in the prevalence estimates due to refusals for women are larger than those for men.

Keywords

Bias Demographic and Health Surveys missing data non-response refusals Malawi

1 Introduction

In sub-Saharan Africa, home to around 23 million people living with HIV,¹ accurate measurement of the trends of important diseases such as HIV is essential for governments to design policies and aid programs. In the past two decades, national population-based surveys have become an important source for such measurement.^2,3 A major challenge in using these survey data is the potential bias from missing data created by non-response. There is much evidence that non-respondents may have patterns of outcome and/or behaviour that are very different from those of the rest of the population.⁴

The problem of non-response has always been a concern for those who work with survey data. One reason why non-response has captured so much attention from researchers is because the nature of the problem is complex. It is widely acknowledged that non-response does not arise from a unitary source under a well-defined situation. Rather, the causes and processes that lead to non-response are varied and often a function of multiple factors, such as the population under study, the nature of the outcome, and the way the survey is designed and conducted. A most challenging issue is that information about the non-respondents is usually scant, making it very difficult for surveyors to determine the nature of non-response.

Non-response arises when sample units in a survey refuse to respond or when the surveyors fail to contact a sample unit.⁵ Many researchers distinguish between non-contacts and refusals because the processes leading to these two types of non-response are believed to be distinct. There are good reasons for espousing this belief. For example, in the context of an HIV survey in rural Africa where the sample units are asked to participate in an HIV test, a non-contact is often the result of migration of the household or absence for work. However, a refusal may be the result of the sample unit's knowledge of his/her HIV status.⁶ Furthermore, we can argue that a non-contact is the result of a passive behaviour since a move of address is a family-based decision that is less likely to be related to the sample unit's HIV status whereas a refusal is an active decision by the sample unit not to provide information about his/her HIV status.^a

Therefore, different approaches are required to address non-contact and refusal. For example, as the study of six national surveys in the UK indicates, repeated efforts to contact the subject may be able to reduce non-contacts but the same cannot be said about refusals.⁷

While there has been a lot of attention paid to issues related to non-response, most of the attention has been directed towards surveys carried out in the developed world.^7–11 We argue that there is a need to consider the problem separately for surveys carried out in developing countries. Our argument rests on three observations. First, in some parts of the developed world, many non-response problems can be, at least partially, resolved by linking survey data to administrative records,^12–15 which is often rich in content and well documented. The same cannot be done easily in many parts of Africa and elsewhere in the developing world as such records often do not exist, are poorly archived, or outdated. Second, many researchers advocated using callbacks to reduce the non-response rate.^16–18 While the developing world has witnessed a massive expansion of mobile phone and broadband networks, such means of contacting sampled units remain practically infeasible in impoverished areas where telephones and computers are not affordable or in sparsely populated areas without easy access to such networks. Third, it is often difficult to rule out that non-response is non-informative. In that situation, unbiased inferences are still possible by combining the survey data with information from longitudinal data in a comparable population.^19,20,9 In many parts of the developing world, however, the organisation of a nationally representative longitudinal study is difficult due to mobility of individuals and lack of reliable demographic records, especially in rural areas, statistical capacity and necessary financial resources. Hence, such a strategy needs to be adapted to the conditions in the developing world.

In this paper, we study non-response due to refusals to HIV testing using data collected from a nationally representative survey under the Demographic and Health Surveys (DHS) program. Some relevant earlier works include, Garcia-Calleja et al.,³ who carried out a scenario study for 20 sub-Saharan countries using HIV relative risks between the non-respondents and respondents. However, they did not treat non-contacts and refusals separately. Marston et al.⁴ examined non-response bias in a nine-country study. They assumed non-response is non-informative and estimated the prevalence among the non-respondents by multiple imputation.²¹ Similarly, Mishra et al.²² used a logistic regression to predict the HIV prevalence among the non-respondents under a non-informative non-response assumption in a 12-country study. Hogan et al.²³ adjusted non-response bias by a selection model,²⁴ which allows non-response to be informative but requires the existence of a valid instrumental variable that explains non-response but not the outcome. Reniers and Eaton²⁵ and Floyd et al.²⁶ corrected refusal bias in population surveys by using auxiliary longitudinal data. Their methods rely on the assumption that refusal behaviour in different populations is comparable. In some of the methods discussed below, we also adopt a similar assumption.

The main contribution of this paper is threefold. First, we put together existing methods and re-examine their underlying assumptions; we discuss the possible merits and demerits of each of these assumptions. Second, we introduce a few alternative novel approaches to HIV prevalence estimates that adjust for refusal behaviour and compare them to existing methods on a common platform. This comparison allows us to determine how important refusal bias may be. Third, based on thorough robustness checks against potential refusal bias, we draw lessons that could be applied elsewhere.

2 Study design and survey data

In health and population studies in Africa, the following three types of survey data are often available: national population-based surveys, sentinel surveillance surveys and longitudinal surveys. National population-based surveys are usually large scale, cross-sectional studies with the intent of drawing nationally representative samples. They collect detailed demographic characteristics and various outcomes of interest, such as health, nutrition and land use. Sentinel surveillance surveys are useful for capturing cross-sectional data over time, such as outbreak of disease, nutritional trends and changes in land use, at sentinel sites. The sentinel sites are typically located in the more densely populated urban areas and hence may not be representative of the general population in many developing countries, since most of them have a sizable proportion of rural population. Longitudinal surveys collect data on vital events and migration for individuals and households over time. When linked with appropriate data, such as individual demographic and behaviour information, longitudinal survey data make it possible to evaluate cause-specific impacts on outcome of interest. However, since longitudinal surveys are often carried out in smaller communities at specific locations, inferences drawn from them are unlikely to be directly applicable to the general population. We use all of these three types of surveys in Malawi for empirical illustration. We examine the relevance and implications of different approaches to the estimation of HIV prevalence.

The primary data source for this study is the 2004 Malawi Demographic and Health Survey (MDHS), which is a nationally representative survey. All women aged 15–49 years in a selected household are eligible for interview. In about one in three selected households, male members of the household aged 15–54 years are also surveyed and HIV testing is offered to both male and female members. We focus on those aged 49 years or below to keep the same age group for both women and men, and also make our study comparable to the MDHS report.²⁷ In addition, we exclude those who refused to answer the individual questionnaire, those who consented but their HIV testing results are not available (e.g., technical problem), and those whose previous HIV testing history (i.e., whether the individual has previously taken an HIV test) is not known. We note that Lilongwe district has an unusually high refusal rate (54%) and low observed prevalence (Figure 1). In an earlier report,²⁷ separate analyses were carried out with and without Lilongwe. To facilitate comparison with earlier studies,^25,27 we elect to include Lilongwe in the main article; the results of a parallel analysis, excluding Lilongwe, are given separately as online supplemental materials.

Figure 1.

Malawi district level HIV testing refusal patterns in 2004 MDHS. (a) MDHS: urban, (b) MDHS: rural, (c) ANC: urban and (d) ANC: rural.

In addition to the MDHS data, we also use the 2003 Malawi antenatal clinics (ANC) survey data.²⁸ The collection of HIV data in the Malawi ANC started in 1990 and by 2003, there were 19 ANC sites in Malawi. In the 2003 ANC, HIV data were collected on nearly 8000 pregnant women, of which 20%, 49% and 31% are in rural, semi-urban and urban areas, respectively.²⁸

Lastly, we use a dataset collected under the Malawi Diffusion and Ideational Change Project (MDICP), which consists of a series of longitudinal surveys conducted in the rural areas in three districts of Malawi, one from each of the Southern, Central, and Northern regions of Malawi. As such it is not representative of the general population of Malawi. The sample is made up of married women and their husbands in the selected households. We only use the 2004 (MDICP-3) and 2006 (MDICP-4) phases as HIV test component is available only for these phases.

3 Assumptions and methods for estimating HIV prevalence

The goal of our research is to estimate HIV prevalence in a population of interest using sample surveys (such as DHS) drawn randomly from the population. However, such surveys might suffer from non-responses due to refusals which might lead to bias. In this section, we discuss various methods for estimating HIV prevalence, including those previously used in the literature and some newly introduced in this study. We begin our analysis by first ignoring selection bias and estimate HIV prevalence by simply taking the sample proportion of HIV status based on only those who accept an HIV test.

Let D_i be an indicator variable that takes one if individual i is HIV positive and zero otherwise. The goal of our research is to identify $π \equiv E [D_{i}]$ , where i is drawn randomly from the population of interest. Sometimes, we are also interested in HIV prevalence of certain sub-populations. In that case, the parameter of interest is $E [D_{i} | Z_{i}]$ , where Z_i is a variable that characterises the sub-populations, which may include the location of residence, gender, occupation and education level. However, we drop Z_i hereafter, because the same method can be used for estimating HIV prevalence in each sub-population of interest by restricting the sample used for estimation accordingly.

We typically estimate $E [D_{i}]$ from sample surveys such as DHS, because it is prohibitively expensive and practically infeasible to measure D_i for all individuals in the population. Let N be the total number of individuals in our MDHS sample and R_i is an indicator variable for refusal such that R_i = 0 indicates individual i accepts an HIV test. Therefore, if we ignore the selection on non-refusals, $E [D_{i}]$ can be estimated by the complete case estimator

{\overset{\land}{π}}_{CC} = \frac{\sum_{i = 1}^{N} (1 - R_{i}) D_{i}}{\sum_{i = 1}^{N} (1 - R_{i})}

(1)

An advantage of the estimator ${\overset{\land}{π}}_{CC}$ is that it is easy to calculate and requires no additional models. However, even if the sample is random, ${\overset{\land}{π}}_{CC}$ is only an unbiased estimator for $E [D_{i} | R_{i} = 0]$ and not for $E [D_{i}]$ in general. Hence, unless we have $E [D_{i}] = E [D_{i} | R_{i} = 0], {\overset{\land}{π}}_{CC}$ is good only as an estimator of HIV prevalence of those who would agree to take an HIV test when such a test is offered.

In practice, we have no strong reason to believe a priori that $E [D_{i}] = E [D_{i} | R_{i} = 0]$ holds. To address this issue, certain additional assumptions and/or data are required. For example, assume that HIV status can be explained by a set of covariates X_i observable on every individual in the MDHS data and that there is no refusal bias.

In the current context, this method requires

P (D_{i} = 1 | X_{i}, R_{i} = 0) = P (D_{i} = 1 | X_{i}, R_{i} = 1) = P (D_{i} = 1 | X_{i})

(2)

If an unbiased estimator ${\overset{\land}{D}}_{i}$ of $P (D_{i} = 1 | X_{i})$ can be obtained from those with observed HIV status, then we can estimate the prevalence by a method equivalent to the mean score imputation (MSI) method, e.g., Pepe et al.,²⁹ in the missing data literature

{\overset{\land}{π}}_{MSI} = \frac{\sum_{i = 1}^{N} (1 - R_{i}) D_{i} + R_{i} {\overset{\land}{D}}_{i}}{N}

(3)

As we pointed out earlier, the estimator ${\overset{\land}{π}}_{CC}$ is generally a biased estimator of $E [D_{i}]$ . Another possibility is to model the probability of refusal using covariates X_i and assume that D_i is conditionally independent of refusal, given X_i (equation (2)). To keep the presentation simple, we temporarily assume that X_i is discrete but this assumption can be relaxed. With these assumptions, we have

E [D_{i}] = \sum_{x} E [D_{i} | X_{i} = x] \cdot P [X_{i} = x] = \sum_{x} E [D_{i} | R_{i} = 0, X_{i} = x] \cdot P [X_{i} = x]

Let $I (\cdot)$ be an indicator function (which takes one if its argument is true and zero otherwise), we can estimate $E [D_{i}]$ by

\sum_{x} [\frac{\sum_{i = 1}^{N} (1 - R_{i}) D_{i} I (X_{i} = x)}{\sum_{i = 1}^{N} (1 - R_{i}) I (X_{i} = x)}] [\sum_{i = 1}^{N} \frac{I (X_{i} = x)}{N}]

(4)

The estimator above is unbiased if a suitable discrete covariate X_i can be found. In practice, a discrete covariate is often not sufficient to completely explain selection due to refusal. A more general estimator

{\overset{\land}{π}}_{if} = \sum_{i = 1}^{N} \frac{(1 - R_{i}) D_{i}}{P (R_{i} = 0)} / \sum_{i = 1}^{N} \frac{(1 - R_{i})}{P (R_{i} = 0)}

(5)

is unbiased for

E [D_{i}]

. We use ‘IF’, which stands for infeasible, to qualify this estimator because

P (R_{i} = 0)

is generally unknown. If we replace

P (R_{i} = 0)

by an estimator

\overset{\land}{P} (R_{i} = 0) \equiv \overset{\land}{P} (R_{i} = 0 | X_{i})

{\overset{\land}{π}}_{if}

and call this estimator

{\overset{\land}{π}}_{1}

, then it becomes the well-known inverse probability or inverse propensity score estimator.³⁰ The estimator

{\overset{\land}{π}}_{1}

can be viewed as a continuous version of equation (4). Unbiasedness of

{\overset{\land}{π}}_{1}

requires

P (R_{i} = 0) = P (R_{i} = 0 | X_{i}) = P (R_{i} = 0 | X_{i}, D_{i})

, which is the conditional independence assumption for equation (4).

A common strategy to come up with $\overset{\land}{P} (R_{i} = 0 | X_{i})$ is to use a parametric model, usually a logistic regression using variables that are thought to predict acceptance of an HIV test (see, for example, National Statistical Office and ORC Macro, 2005 Appendix G²⁷). However, this strategy works only if the model of acceptance is known and covariates in the model are observable.

To address refusal due to the prior knowledge of HIV status, Reniers and Eaton²⁵ suggested a method to estimate $E [D_{i}]$ under the following two assumptions

\begin{matrix} P (R_{i} = 1 | D_{i} = 1, T_{i} = 0) = P (R_{i} = 1 | D_{i} = 0, T_{i} = 0) \\ = P (R_{i} = 1 | T_{i} = 0) \end{matrix}

(6)

P (D_{i} = 1 | T_{i} = 1) = P (D_{i} = 1)

(7)

where T_i = 0 means that a subject does not know his/her HIV status and T_i = 1 means that a subject has had an HIV test and knows the test result. The first assumption given in equation (6) states that refusal is independent of HIV status given that the subject has never taken an HIV test before. The second assumption in equation (7) states that being tested previously does not depend on one's HIV status. Under these assumptions, the following quadratic equation in

P (D_{i} = 1) \equiv E [D_{i}]

can be shown to hold

\begin{matrix} 0 = [{P (R_{i} = 0 | T_{i} = 0) P (T_{i} = 0) + P (T_{i} = 1)} (Δ - 1)] P (D_{i} = 1) 2 \\ + [- P (D_{i} = 1 | R_{i} = 0) P (R_{i} = 0) (Δ - 1) + P (R_{i} = 0 | T_{i} = 0) P (T_{i} = 0) \\ + {1 - Δ P (R_{i} = 1 | T_{i} = 1)} P (T_{i} = 0)] P (D_{i} = 1) \\ - P (D_{i} = 1 | R_{i} = 0) P (R_{i} = 0) \end{matrix}

(8)

where the relative risk of refusal Δ is defined as follows

Δ \equiv \frac{P (R_{i} = 1 | D_{i} = 1, T_{i} = 1)}{P (R_{i} = 1 | D_{i} = 0, T_{i} = 1)}

Reniers and Eaton²⁵ used MDICP data to estimate Δ and MDHS data to estimate the remaining quantities in equation (8). Their estimator ${\overset{\land}{π}}_{RE}$ of $E [D_{i}]$ is the unique root of the quadratic equation on the unit interval.

There are a few issues with the assumptions above. First, notice that equations (6) and (7) imply

P (D_{i} = 1) = P (D_{i} = 1 | T_{i} = 0) = P (D_{i} = 1 | T_{i} = 0, R_{i} = 0)

(9)

This suggests we can estimate the prevalence of HIV by

{\overset{\land}{π}}_{2} = \frac{\sum_{i = 1}^{N} (1 - R_{i}) D_{i} (1 - T_{i})}{\sum_{i = 1}^{N} (1 - R_{i}) (1 - T_{i})}

(10)

Therefore, once equations (6) and (7) are assumed, we do not need MDICP data to estimate the HIV prevalence. Second, both of these assumptions may be problematic in practice. Equation (6) is not compelling because individuals may know the risk of HIV infection even without HIV testing. Equation (7) may also be called into question, because those who have taken HIV tests before may be systematically different from others.

Given these issues, we propose to estimate lower and upper bounds of $P (D_{i} = 1) \equiv E [D_{i}]$ under the following assumptions

P (D_{i} = 1 | {\tilde{T}}_{i} = 0, R_{i} = 0) \leq P (D_{i} = 1 | {\tilde{T}}_{i} = 0, R_{i} = 1) \leq P (D_{i} = 1 | {\tilde{T}}_{i} = 1, R_{i} = 1)

(11)

where

{\tilde{T}}_{i}

differs slightly from the definition of T_i used by Reniers and Eaton²⁵ in that

{\tilde{T}}_{i} = 0

means a subject has not taken a prior HIV test and

{\tilde{T}}_{i} = 1

represents a subject has had an HIV test but may or may not know the result of the test. The first inequality in equation (11) captures the idea that those who refuse to take HIV test are no less likely to be HIV positive than those who participate, given that they have never taken an HIV test before. Note that the first inequality becomes an equality when equation (6) is satisfied. The second inequality captures the idea that those who have previously taken an HIV test are no less likely to be HIV positive than those who have never taken a test given they refuse to participate in the HIV testing.

In addition to these assumptions, we explicitly account for the fact that MDICP is not representative of the general population of Malawi, because the data are taken only from a few rural districts. We use M_i = 1 to denote individual i belongs to the population that the MDICP sample represents, and zero otherwise. We assume that the relative risk of HIV between MDICP population and non-MDICP population is independent of refusal given that an individual has had a previous HIV test. Mathematically, our assumption implies

\begin{matrix} Z \equiv \frac{P (D_{i} = 1 | {\tilde{T}}_{i} = 1, M_{i} = 0)}{P (D_{i} = 1 | {\tilde{T}}_{i} = 1, M_{i} = 1)} \\ = \frac{P (D_{i} = 1 | {\tilde{T}}_{i} = 1, R_{i} = 1, M_{i} = 0)}{P (D_{i} = 1 | {\tilde{T}}_{i} = 1, R_{i} = 1, M_{i} = 1)} \\ = \frac{P (D_{i} = 1 | {\tilde{T}}_{i} = 1, R_{i} = 0, M_{i} = 0)}{P (D_{i} = 1 | {\tilde{T}}_{i} = 1, R_{i} = 0, M_{i} = 1)} \end{matrix}

(12)

Under this assumption, the numerator and denominator of the last line of equation (12) can be estimated with the MDHS and MDICP data, respectively. Letting $W \equiv P (M_{i} = 1) + ZP (M_{i} = 0)$ , we can write

\begin{matrix} P (D_{i} = 1 | {\tilde{T}}_{i} = 1, R_{i} = 1) = P (D_{i} = 1 | {\tilde{T}}_{i} = 1, R_{i} = 1, M_{i} = 1) P (M_{i} = 1) \\ + P (D_{i} = 1 | {\tilde{T}}_{i} = 1, R_{i} = 1, M_{i} = 0) P (M_{i} = 0) \\ = P (D_{i} = 1 | {\tilde{T}}_{i} = 1, R_{i} = 1, M_{i} = 1) W \end{matrix}

(13)

where we additionally made the assumption that

P (M_{i} | {\tilde{T}}_{i}, R_{i}) = P (M_{i})

. In the MDICP data, we observe the HIV status of those who participate in the first HIV test but refuse the second HIV test. Therefore, we can estimate

P (D_{i} = 1 | {\tilde{T}}_{i} = 1, R_{i} = 1, M_{i} = 1)

by the proportion of HIV positives in the first test among those who refuse the second test. To use equation (13), we also need to estimate W, which in turn requires estimates of

P (M_{i} = 0), P (M_{i} = 1)

, and Z. Since the MDICP sample was taken to match closely the rural sample of the 1996 MDHS, we may take

P (M_{i} = 1)

to represent the proportion of rural population in Malawi and

P (M_{i} = 0)

the urban population, both of which can be estimated using population census data. For Z, we can use the MDHS and MDICP data to estimate the numerator and denominator, respectively.

We also define

Z' = \frac{P (D_{i} = 1 | {\tilde{T}}_{i} = 0, R_{i} = 0, M_{i} = 0)}{P (D_{i} = 1 | {\tilde{T}}_{i} = 0, R_{i} = 0, M_{i} = 1)}

(14)

and letting

W' \equiv P (M_{i} = 1) + Z' P (M_{i} = 0)

, we can write

\begin{matrix} P (D_{i} = 1 | {\tilde{T}}_{i} = 0, R_{i} = 0) = P (D_{i} = 1 | {\tilde{T}}_{i} = 0, R_{i} = 0, M_{i} = 1) P (M_{i} = 1) \\ + P (D_{i} = 1 | {\tilde{T}}_{i} = 0, R_{i} = 0, M_{i} = 0) P (M_{i} = 0) \\ = P (D_{i} = 1 | {\tilde{T}}_{i} = 0, R_{i} = 0, M_{i} = 1) W' \end{matrix}

(15)

Estimation of equation (15) follows easily since the numerator and denominator of $Z'$ can be estimated using data from MDHS and MDICP, respectively.

Using equation (13), we have the following relationship

\begin{matrix} P (D_{i} = 1) = P (D_{i} = 1, R_{i} = 0) + P (D_{i} = 1 | {\tilde{T}}_{i} = 1, R_{i} = 1) P ({\tilde{T}}_{i} = 1, R_{i} = 1) \\ + P (D_{i} = 1 | {\tilde{T}}_{i} = 0, R_{i} = 1) P ({\tilde{T}}_{i} = 0, R_{i} = 1) \\ = P (D_{i} = 1, R_{i} = 0) + WP (D_{i} = 1 | {\tilde{T}}_{i} = 1, R_{i} = 1, M_{i} = 1) P ({\tilde{T}}_{i} = 1, R_{i} = 1) \\ + P (D_{i} = 1 | {\tilde{T}}_{i} = 0, R_{i} = 1) P ({\tilde{T}}_{i} = 0, R_{i} = 1) \end{matrix}

(16)

Notice that in equation (16), $P (D_{i} = 1 | {\tilde{T}}_{i} = 0, R_{i} = 1)$ cannot be estimated because test results are not available for those individuals who have had no prior HIV test and decline the current test. Hence, the estimation of equation (16) is not feasible. However, by equations (11), (15) and (16), we can form bounds

P_{-} \leq P (D_{i} = 1) \leq P_{+}

where

\begin{matrix} P_{-} = P (D_{i} = 1, R_{i} = 0) + WP (D_{i} = 1 | {\tilde{T}}_{i} = 1, R_{i} = 1, M_{i} = 1) P ({\tilde{T}}_{i} = 1, R_{i} = 1) \\ + P (D_{i} = 1 | {\tilde{T}}_{i} = 0, R_{i} = 0) P ({\tilde{T}}_{i} = 0, R_{i} = 1) \\ = P (D_{i} = 1, R_{i} = 0) + WP (D_{i} = 1 | {\tilde{T}}_{i} = 1, R_{i} = 1, M_{i} = 1) P ({\tilde{T}}_{i} = 1, R_{i} = 1) \\ + W' P (D_{i} = 1 | {\tilde{T}}_{i} = 0, R_{i} = 0, M_{i} = 1) P ({\tilde{T}}_{i} = 0, R_{i} = 1) \end{matrix}

(17)

\begin{matrix} P_{+} = P (D_{i} = 1, R_{i} = 0) + WP (D_{i} = 1 | {\tilde{T}}_{i} = 1, R_{i} = 1, M_{i} = 1) P ({\tilde{T}}_{i} = 1, R_{i} = 1) \\ + P (D_{i} = 1 | {\tilde{T}}_{i} = 1, R_{i} = 1) P ({\tilde{T}}_{i} = 0, R_{i} = 1) \\ = P (D_{i} = 1, R_{i} = 0) + WP (D_{i} = 1 | {\tilde{T}}_{i} = 1, R_{i} = 1, M_{i} = 1) P (R_{i} = 1) \end{matrix}

(18)

We can estimate $P (D_{i} = 1, R_{i} = 0), P ({\tilde{T}}_{i} = 0, R_{i} = 1)$ and $P (R_{i} = 1)$ with the MDHS data. Other terms can be estimated by equations (12), (13), (14) and (15) with the MDICP data. For the computation of the estimators, the following definitions in the MDICP data are used: ${\tilde{T}}_{i} = 1$ if an individual has a test in MDICP-3, D_i = 1 if an individual tests positive in MDICP-3 or MDICP-4, R_i = 1 if an individual tests in MDICP-3 but refuses a test in MDICP-4. Using these estimates, we obtain the estimates ${\overset{\land}{π}}_{3 -}$ and ${\overset{\land}{π}}_{3 +}$ of $P_{-}$ and $P_{+}$ , respectively.

A third source of data that allows estimation of $E [D_{i}]$ is the ANC surveys.^31,28 To produce national prevalence estimates, the district-area prevalence estimates obtained using ANC data are combined with census data. For each district-area c captured in ANC surveys, let w_c be a weight that gives the proportion of individuals living in district-area c from the census. (We use 1998 census figures for all district-areas except Likoma and Mzuzu. For Likoma and Mzuzu, separate figures were not given in the 1998 census, so we use figures from the 2008 census.) Then an estimator of the population HIV prevalence is

{\overset{\land}{π}}_{4} = \sum_{c} ({\overset{\land}{π}}_{ANC}^{c} \frac{w_{c}}{\sum_{c'} w_{c'}})

(19)

where

{\overset{\land}{π}}_{ANC}^{c}

is the prevalence estimator in district-area c using ANC data. This method has also been used in cross-national studies comparing ANC-based to population-based survey estimates.³²

If we let ${\tilde{M}}_{i} = 1$ be an indicator for an individual who has been tested at an ANC site, then ${\overset{\land}{π}}_{4}$ makes the following assumption

P (D_{i} = 1 | {\tilde{M}}_{i} = 1, C_{i} = c) = P (D_{i} = 1 | {\tilde{M}}_{i} = 0, C_{i} = c) = P (D_{i} = 1 | C_{i} = c)

(20)

where C_i is defined as the index of the district-area in which the i-th individual resides, such that C_i = c means that an individual comes from district-area c. In other words, given that individuals are matched by district-area, HIV prevalence of the ANC attendees is the same as that in the general population.

When refusal to an HIV test may be due to the (unobservable) HIV status of a sampled unit,^25,26 then the use of known data to estimate $P (R_{i} = 0)$ will not yield the desired results. This is the classical problem of non-ignorable missingness in the missing data literature.³³

We propose a method that mitigates the problem of non-ignorable missingness by using information routinely recorded in ANC surveys. We assume

P (R_{i} = 0) = g (D_{i}, X_{i}) \equiv P (R_{i} = 0 | D_{i}, X_{i})

(21)

for some known function g that depends on the HIV status D_i and some observable covariates X_i. Of course, equation (21) cannot be used because D_i is unknown for those who refuse an HIV test. Therefore, we make the following assumption

P (R_{i} = 0 | X_{i} = x, D_{i}, {\overset{\land}{π}}_{ANC}^{c}, C_{i} = c) = P (R_{i} = 0 | X_{i} = x, {\overset{\land}{π}}_{ANC}^{c}, C_{i} = c)

(22)

which says that for an individual in a particular district-area, acceptance of an HIV test is independent of the individual's HIV status, given the covariates and the HIV prevalence in that district-area estimated from the ANC data.

The conditional independence assumption equation (22) allows us to have a workable solution since ${\overset{\land}{π}}_{ANC}^{c}$ can be obtained using data in every HIV sentinel surveillance report.^28,34 Let $\overset{\land}{P} (R_{i} = 0 | X_{i} = x, {\overset{\land}{π}}_{ANC}^{c}, C_{i} = c)$ be an estimator of $P (R_{i} = 0 | X_{i} = x, {\overset{\land}{π}}_{ANC}^{c}, C_{i} = c)$ which may be based on a logistic regression model. Then, we estimate $E [D_{i}]$ by

{\overset{\land}{π}}_{5} = \sum_{i = 1}^{N} \frac{(1 - R_{i}) D_{i}}{\overset{\land}{P} (R_{i} = 0 | X_{i} = x, {\overset{\land}{π}}_{ANC}^{c}, C_{i} = c)} / \sum_{i = 1}^{N} \frac{(1 - R_{i})}{\overset{\land}{P} (R_{i} = 0 | X_{i} = x, {\overset{\land}{π}}_{ANC}^{c}, C_{i} = c)}

(23)

We consider two estimators based on ${\overset{\land}{π}}_{5}$ . The first one uses ${\overset{\land}{π}}_{ANC}^{c}$ and a stepwise regression procedure to select from the same list of covariates X_i used in $\overset{\land}{P} (R_{i} = 0 | X_{i})$ for the estimation of ${\overset{\land}{π}}_{1}$ . The second one uses only ${\overset{\land}{π}}_{ANC}^{c}$ for modelling the propensity score. These propensity scores are the used in equation (23) to give different prevalence estimators, ${\overset{\land}{π}}_{5 A}$ and ${\overset{\land}{π}}_{5 B}$ , respectively.

A summary of this and other estimators considered in this paper with their key estimation equations, identifying assumptions and data requirement is given in Table 1.

Table 1.

Summary of estimators considered in this study.

Estimator	Key equation(s)	Identifying assumption(s)	Data source
${\overset{\land}{π}}_{CC}$	Equation (1)	No refusal bias	MDHS
${\overset{\land}{π}}_{MSI}$	Equation (3)	Equation (2) and no refusal bias conditional on X_i	MDHS
${\overset{\land}{π}}_{if}$	Equation (5)	Infeasible
${\overset{\land}{π}}_{1}$	Equations (4) and (5)	Use $\overset{\land}{P} (R_{i} = 0) \equiv \overset{\land}{P} (R_{i} = 0 \| X_{i})$ in ${\overset{\land}{π}}_{if}$	MDHS
${\overset{\land}{π}}_{2}$	Equation (10)	Equations (6) and (7)	MDHS
${\overset{\land}{π}}_{RE}$	Equation (8)	Equations (6) and (7); see also Reniers and Eaton²⁵	MDHS, MDICP
${\overset{\land}{π}}_{3 +}, {\overset{\land}{π}}_{3 -}$	Equations (17) and (18)	Equations (11) and (12)	MDHS, MDICP, Census
${\overset{\land}{π}}_{4}$	Equation (19)	Equation (20)	ANC, Census
${\overset{\land}{π}}_{5 A}$	Equation (23)	Equations (21) and (22): stepwise regression using X_i and ${\overset{\land}{π}}_{ANC}^{c}$	MDHS, ANC
${\overset{\land}{π}}_{5 B}$	Equation (23)	Equations (21) and (22): fixed regression using ${\overset{\land}{π}}_{ANC}^{c}$ only	ANC

ANC: antenatal clinics; MDHS: Malawi Demographic and Health Survey; MDICP: Malawi Diffusion and Ideational Change Project.

4 Results

4.1 Refusal patterns

We first study the possible bias in the prevalence estimates due to refusals. We begin by summarising the refusal patterns in the data in Table 2. It is clear from the table that the refusal rate of around 23.9% in MDHS is far higher than those in the other two surveys. There are no refusals in the ANC survey as HIV test was carried out based on blood samples left behind for syphilis test and no consent was sought. For MDICP-3, the refusal rate is about 9.5% and for MDICP-4, we obtain a refusal rate of 5.4%, among those who tested in MDICP-3. The refusal rates among men are similar to those in women, in all surveys. For MDHS, the refusal rate for men is

715 / 2984 \approx 0.240

and for women is

886 / 3712 \approx 0.239

; the corresponding figures for MDICP-3 are

141 / 1490 \approx 0.094

and

163 / 1723 \approx 0.094

, respectively, and for MDICP-4,

55 / 948 \approx 0.058

and

60 / 1163 \approx 0.052

, respectively. Similar patterns of refusal rates are reported elsewhere.^25,35 The slight differences between our figures and those reported in Reniers and Eaton²⁵ and Obare³⁵ can be attributed to the different baseline samples used (for example, Reniers and Eaton²⁵ included males aged 15–54 whereas we only used those aged 15–49, in line with the 2004 MDHS report). The district-level HIV refusal map for MDHS shown in Figure 1 indicates higher rates in the central and southern parts of Malawi. There is high variation in the refusal rates across the districts.

Table 2.

Refusal patterns in MDHS, ANC and MDICP.

Source	No. eligible	No. refused	Percent
MDHS	6696	1601	23.9
ANC^a	7977	0	0.0
MDICP-3^b	3123	304	9.5
MDICP-4^c	2111	115	5.4

ANC: antenatal clinics; MDHS: Malawi Demographic and Health Survey; MDICP: Malawi Diffusion and Ideational Change Project.

Consent not required.

Among those contacted in MDICP-3.

Among those tested in MDICP-3 and contacted in MDICP-4.

4.2 Adjustment of HIV prevalence estimates

We apply various estimators considered in the previous section to MDHS, ANC and MDICP data. A summary of the results is given in Table 3. For each estimator, we obtain separate HIV prevalence estimates for women and men. The estimates are then combined to derive overall estimates. In deriving these estimates, sampling weights need to be considered. The 2004 MDHS report²⁷ (Tables 12.5, Appendix G.1, and p. 452) uses sampling weights for calculating HIV prevalence and adjusted rates. These sampling weights are made up of three types: (1) HIV sampling weights for those who are tested; (2) individual sampling weights for those interviewed but not tested; and (3) household sampling weights for those who are not interviewed and not tested. In Reniers and Eaton,²⁵ the sampling weighting scheme of the 2004 MDHS report was applied to the MDHS data but no weights (except by the subgroup proportion of the population) were applied to the MDICP data. Sampling weights do not apply to ANC data since they come from women who visited ANC sites. To facilitate comparison to earlier results, we follow the same strategy as earlier studies in handling sampling weights for the MDHS data and MDICP data. For the ANC data, data are weighted by their proportional representation from census. We return to the discussion of sampling weights and their relationship to refusal bias subsequently.

Table 3.

HIV prevalence estimates using MDHS, ANC and MDICP data.

Estimator	Men	Women	Overall
${\overset{\land}{π}}_{CC}$	0.1029	0.1347	0.1194
${\overset{\land}{π}}_{MSI}$	0.1154	0.1385	0.1274
${\overset{\land}{π}}_{1}$	0.1118	0.1368	0.1247
${\overset{\land}{π}}_{2}$	0.0992	0.1319	0.1165
${\overset{\land}{π}}_{RE}$	0.1130	0.1470	0.1306
${\overset{\land}{π}}_{3 -}$	0.0935	0.1174	0.1059
${\overset{\land}{π}}_{3 +}$	0.1183	0.1556	0.1376
${\overset{\land}{π}}_{4}$	–	0.1550^a	–
${\overset{\land}{π}}_{5 A}$ ^b	0.1144	0.1377	0.1265
${\overset{\land}{π}}_{5 B}$ ^c	0.1150	0.1397	0.1278

ANC: antenatal clinics; MDHS: Malawi Demographic and Health Survey; MDICP: Malawi Diffusion and Ideational Change Project.

Based only on pregnant females in the ANC survey.

Stepwise regression using covariates, X_i and ${\overset{\land}{π}}_{ANC}^{c}$ .

Fixed regression using ${\overset{\land}{π}}_{ANC}^{c}$ only.

There are 6696 individuals eligible for HIV testing in our MDHS sample. Out of these individuals, 1601 individuals (886 women and 715 men) expressly refuse to take an HIV test. Among the remaining 5095 individuals, 647 individuals (418 women and 229 men) are found to be HIV positive and 4448 individuals (2408 women and 2040 men) are HIV negative, giving an overall unweighted HIV prevalence of $647 / 5095 \approx 0.1270$ . All subsequent analyses are, however, based on weighted cases, as described earlier. The complete case estimate of HIV prevalence ${\overset{\land}{π}}_{CC}$ in women is 0.1347. Similarly, the complete case prevalence estimate for men is approximately 0.1029. The overall estimate combining the women and men estimates is about 0.1194. Compared to ${\overset{\land}{π}}_{CC}$ , the estimator ${\overset{\land}{π}}_{MSI}$ uses additional information from those who do not take an HIV test. For the prediction of HIV status, we use the same set of covariates as those in the MDHS 2004 report, Appendix G,²⁷ that includes both demographic and behavioural variables: age, wealth index, education, geographical region, rural/urban residence, age at first sex, work status, marital status, smoking/tobacco use, media exposure, religion, STI or STI symptoms, condom use, higher-risk sex in the last year (sex with a non-marital, non-cohabiting partner), test for AIDS, number of sexual partners in the last 12 months, sexually transmitted disease in the last year, and willingness to care for a relative with AIDS. Separate logistic regressions are carried out for women and men. The model is then applied to impute HIV status for those who refuse an HIV test. Using this procedure, the prevalence estimates for women and men are 0.1385 and 0.1154, respectively.

The inverse probability estimator ${\overset{\land}{π}}_{1}$ assumes acceptance of HIV testing may be non-random and that the probability of acceptance can be captured by some observable covariates. We use the same list of covariates from the MDHS 2004 report for estimating the propensity score for acceptance of HIV testing. Due to some individuals with no information on some of the covariates, the model for men includes only 2304 observations from MDHS, as opposed to the entire sample of 2984 men. Out of the 2304 men, 1759 men accepted an HIV test with a weighted average acceptance rate of 0.835, but the interquartile range of the estimated propensity score is from 0.840 to 0.962. Similarly for women, the model is based on 2623 women instead of the entire sample of 3712 women. Out of these 2623 women, 2019 women accepted an HIV test with a weighted average acceptance rate of 0.747, but the interquartile range of the estimated propensity score is 0.813 to 0.932. So for both men and women, the estimated propensity scores are somewhat different from to their respective means, and ${\overset{\land}{π}}_{1}$ accounts for such differences by adjusting the complete case estimates. Indeed, for women and men, the values of ${\overset{\land}{π}}_{1}$ are 0.1368 and 0.1118, respectively, slightly higher than their complete case counterparts.

Out of the 6696 individuals in our MDHS sample, 5816 report that they do not have a prior HIV test. These individuals form the basis for calculating ${\overset{\land}{π}}_{2}$ . Among women who do not have a prior HIV test, 359 have a positive HIV test result while 2138 are HIV negative, giving a weighted HIV prevalence estimate of 0.1319, and the corresponding estimate for men is 0.0992.

A total of 2874 individuals (1539 females and 1335 males) consent to an HIV test and provide complete information for analysis in MDICP-3. Of these individuals, 1996 consent to an HIV test in MDICP-4 and 115 refuse, while the HIV status for the rest is missing for other reasons. Among those individuals who are tested in MDICP-3, 185 (111 females and 74 males) are HIV positive and 2689 (1428 females and 1261 males) are HIV negative.

We repeat the analysis of Reniers and Eaton²⁵ using our data. Since we exclude males aged 50–54 years from the MDHS data whereas Reniers and Eaton included them, we do not expect the two sets of estimates to be identical. To compute the estimate using ${\overset{\land}{π}}_{RE}$ , we need to know whether an individual has taken the first-round HIV test (MDICP-3), whether the individual knows the test result, the actual test result, and the refusal of the second-round HIV test conducted in MDICP-4. The ${\overset{\land}{π}}_{RE}$ estimates for males and females are 0.1130 and 0.1470, respectively, and the combined overall estimate is 0.1306, which is quite similar to the figure of 0.132 in Reniers and Eaton (Table 2).²⁵ The same set of data is also used to find ${\overset{\land}{π}}_{3 -}$ and ${\overset{\land}{π}}_{3 +}$ . The bounds for men are 0.0935 and 0.1183, and for women, they are somewhat wider at 0.1174 and 0.1556, respectively.

To implement the estimator ${\overset{\land}{π}}_{4}$ , we first extract the number of ANC attendees and HIV positive cases from the 19 sentinel sites in the 2003 ANC data.²⁸ The site-specific numbers are then used to represent the HIV prevalence in the rural and urban areas in each of the 28 districts defined in the 2003 ANC Technical Report (Table 2).³⁶ The resulting rural HIV rates in the 28 districts range from 0.0969 to 0.2315 with a mean of 0.1349 while the urban rates range from 0.0993 to 0.3288 with a mean of 0.2010. Finally, the district-area numbers are weighted by the population size from the 1998 Census data (IPUMS, University Minnesota and Malawi National Statistical Office, 1998 Population and Housing Census) to give an overall HIV prevalence estimate of 0.1550. Since the ANC data are based on pregnant women only, only one HIV prevalence estimate is obtained. Estimates using ANC survey data have been used as indicators for national HIV trends.^37–40

The estimator ${\overset{\land}{π}}_{5}$ allows refusal to be dependent on the (unobservable) HIV status (for those who refuse testing). To model the propensity score function for (non)-refusal, we impute the unobservable HIV status with HIV prevalence estimates from the ANC data. The ANC prevalence estimates are obtained for different district-areas; for each individual who resides in a particular district-area, his/her HIV status is imputed by ${\overset{\land}{π}}_{ANC}^{c}$ .

We consider two estimates based on ${\overset{\land}{π}}_{5}$ . The first one, ${\overset{\land}{π}}_{5 A}$ , uses ${\overset{\land}{π}}_{ANC}^{c}$ and a stepwise regression procedure to select from the same list of covariates used in ${\overset{\land}{π}}_{1}$ to model the propensity score. The second one uses only ${\overset{\land}{π}}_{ANC}^{c}$ for modelling the propensity score. These estimated propensity scores are then used in ${\overset{\land}{π}}_{5 B}$ to give different prevalence estimates.

Using ${\overset{\land}{π}}_{ANC}^{c}$ and a selection of other covariates to model the propensity score, the corresponding HIV prevalence estimates, ${\overset{\land}{π}}_{5 A}$ , for women and men are 0.1377 and 0.1144, respectively. When the propensity score is modelled only with ${\overset{\land}{π}}_{ANC}^{c}$ , the corresponding HIV prevalence estimates, ${\overset{\land}{π}}_{5 B}$ for women and men are 0.1397 and 0.1150, respectively.

Table 4 gives the district-level estimates of HIV prevalence estimates using various methods discussed in this paper. There is high variation in HIV prevalence estimates across districts of Malawi, with values ranging from around 5% in Kasungu to as much as 25% in Blantyre. HIV prevalence estimated by ${\overset{\land}{π}}_{1}, {\overset{\land}{π}}_{5 A}$ and ${\overset{\land}{π}}_{5 B}$ are very similar; in most districts, these estimators give higher values than ${\overset{\land}{π}}_{CC}$ . On the other hand, ${\overset{\land}{π}}_{2}$ is similar to ${\overset{\land}{π}}_{CC}$ in most districts. District-level HIV prevalence rates for urban and rural areas directly calculated from MDHS and ANC data are presented in Figure 2. In both data sources, HIV prevalence rates are higher in the urban areas than the rural areas.

Figure 2.

Estimated HIV prevalence rates. (a) Complete case estimates using urban MDHS data. (b) Complete case estimates using rural MDHS data. (c) District-area estimates using urban ANC data. (d) District-area estimates using rural ANC data.

Table 4.

District-level HIV prevalence estimates various methods.

District	${\overset{\land}{π}}_{CC}$	${\overset{\land}{π}}_{1}$	${\overset{\land}{π}}_{2}$	${\overset{\land}{π}}_{5 A}$ ^a	${\overset{\land}{π}}_{5 B}$ ^b
Blantyre	0.2234	0.2538	0.2140	0.2561	0.2561
P Kasungu	0.0418	0.0478	0.0442	0.0481	0.0482
Machinga	0.1159	0.1108	0.1037	0.1093	0.1108
Mangochi	0.2118	0.2275	0.2024	0.2350	0.2349
Mzimba	0.0523	0.0603	0.0497	0.0585	0.0592
Salima	0.0876	0.0706	0.0844	0.0737	0.0737
Thyolo	0.2150	0.2301	0.2203	0.2343	0.2346
Zomba	0.1780	0.1820	0.1683	0.1817	0.1817
Mulanje	0.1969	0.1986	0.1946	0.2003	0.1993
Lilongwe	0.0375	0.0255	0.0362	0.0349	0.0350
Other districts	0.1093	0.1093	0.1096	0.1106	0.1106

Stepwise regression using X_i and ${\overset{\land}{π}}_{ANC}^{c}$ .

Fixed regression using ${\overset{\land}{π}}_{ANC}^{c}$ only.

5 Discussion

This study explored several methods for adjusting refusal bias in HIV prevalence estimates in population-based surveys. It also conducted a thorough investigation of robustness against refusal bias. Compared to the naïve complete case estimator ${\overset{\land}{π}}_{CC}$ , all point estimators except ${\overset{\land}{π}}_{CC}$ give higher adjusted estimates for both men and women (and overall). These results are consistent with those observed in earlier studies.^27,22,25,6

Recall that for ${\overset{\land}{π}}_{2}$ , the key assumptions are equations (6) and (7), which essentially mean that ${\overset{\land}{π}}_{2}$ is a type of complete case estimator applied to those who had never been tested before the 2004 MDHS survey. Hence, it is not surprising that the ${\overset{\land}{π}}_{2}$ estimates are not too different from the naïve ${\overset{\land}{π}}_{CC}$ estimates. Both estimators implicitly assume missing completely at random. In the case of ${\overset{\land}{π}}_{CC}$ , the observed data are considered a random sample of the population. In the case of ${\overset{\land}{π}}_{2}$ , the subsample of those with no prior HIV test and who accepted HIV test form a random sample.

Using the remaining methods, the prevalence for men is consistently adjusted upwards (from the complete case estimate) by about one percentage point, irrespective of the method used.

The case for women is somewhat different. The adjustment is method dependent. The results can be broadly classified into three groups, based on the methods used. The first group of methods, which includes ${\overset{\land}{π}}_{MSI}, {\overset{\land}{π}}_{1}$ and ${\overset{\land}{π}}_{5 A}, {\overset{\land}{π}}_{5 B}$ , uses covariates to model the missing HIV test results (or the propensity that HIV test results are observed). Their results are all quite similar, all give an upward adjustment of HIV prevalence of around 0.5% from the complete case estimate. These methods are related in the sense that they are premised on the HIV status (and hence propensity to accept HIV test) can be modelled using observable demographic and behavioural covariates. Therefore, the methods would not be effective if these covariates have low predictive powers. A multi-country study of bias in HIV estimates from DHS²² found that HIV prevalence is not strongly related to observable covariates.

The methods that combine the MDHS data and MDICP data ( ${\overset{\land}{π}}_{RE}, {\overset{\land}{π}}_{3}$ ) suggest upward adjustments of about one percentage point. Compared to the complete case estimator, the estimator ${\overset{\land}{π}}_{RE}$ adjusts the prevalence of women upwards by 1.3%. Reniers and Eaton²⁵ found that, compared to those who accept an HIV test, individuals who refuse an HIV test are more than 4.5 times as likely to be HIV positive and hence, the upward adjustment is reasonable based on this fact. On the other hand, ${\overset{\land}{π}}_{2}$ , while using the same assumptions as ${\overset{\land}{π}}_{RE}$ , does not give an upward adjustment of the complete case rates (either men, women or overall). This raises the question of why they are different. Comparing equation (8) to equation (10), we notice that the latter ignores those who refused to be tested (see above for the complete case interpretation of ${\overset{\land}{π}}_{2}$ ) while the former explicitly estimates the missing HIV status using MDICP data. Hence, ${\overset{\land}{π}}_{RE}$ is more similar to a MSI or imputation approach.^29,41 Naturally, if we assume that equations (6) and (7) hold and that the MDICP data can be used to replace the missing MDHS data, ${\overset{\land}{π}}_{RE}$ uses additional covariate information from observations with missing HIV status, hence more accurate than ${\overset{\land}{π}}_{2}$ .

Another method that also uses the MDICP data is ${\overset{\land}{π}}_{3}$ . We observe the lower and upper bounds for the HIV prevalence are fairly tight around the complete case estimates. Since these bounds are created with very mild assumptions, the fact that they are very close to the complete case estimates suggests that the refusal bias in the MDHS estimates may be quite small. Between men and women, the bounds for women are much wider. In particular, the upper bound for women is over 2% points above that of the complete case estimate for women. This result is consistent with the behaviour of ${\overset{\land}{π}}_{RE}$ , which adjusts the estimate for women upwards.

The third group is the method that uses the ANC data. The ANC survey provides a single prevalence estimate ( ${\overset{\land}{π}}_{4}$ ) for women, and is significantly higher than most of the prevalence rates from other methods. This result is not surprising since ANC surveys only capture data from pregnant women in more urbanised areas who choose to go to an ANC during their pregnancy and have rates higher than the national average. There are indeed some evidence that applying ANC prevalence directly to give population prevalence estimates leads to biases.^42–44 Nevertheless, ANC prevalence does reflect the actual but unknown prevalence within each district-area and is free of refusal (or other kinds of non-response) bias.

6 Conclusion and implication for future research

The motivation for our paper is to provide a coherent and comprehensive conceptual framework for studying survey data with non-response due to refusals. We revisited some existing methods and also introduced new ones. Our paper offers a novel approach to the challenges that refusals create and proposes possible solutions for them. We compared various methods, clarifying their underlying assumptions, implications and data requirements. The approach offered in this paper is especially useful for practitioners in charge of planning and analysis. The primary application of our approach is the estimation of HIV prevalence particularly in Africa, where HIV/AIDS remains epidemic or endemic. Our approach is also applicable to other issues and areas with similar challenges.

Longitudinal surveys are still uncommon in many parts of the developing world, since they are difficult to implement and the quality of data from such surveys is often poor because of the difficulty with tracking mobile populations. While longitudinal studies are still relatively rare, the availability of nationally representative longitudinal studies is on the rise in developing countries. One of our contributions lies in proposing ways to meaningfully bring together the following three very different three types of data: MDHS, ANC and MDICP. We show how these data can be combined when none of them can allow us to reliably estimate HIV prevalence in Malawi on their own.

A common approach for adjusting (refusal) bias in surveys is by weighting. Methods such as ${\overset{\land}{π}}_{1}$ in this paper, whether using sampling weights, or weights based on fitting a propensity function, use this approach. This approach works only if refusal is independent of the outcome, given the covariates that are used to model the propensity function. In the missing data literature, this condition is called missing at random. However, it can never be confirmed whether the missing at random assumption actually holds. We considered alternative methods to solve this problem, by exploiting information from auxiliary surveys. Using the assumptions of Reniers and Eaton,²⁵ we identified a new method ( ${\overset{\land}{π}}_{2}$ ) using only MDHS data. The method uses data from those who have never been tested and do not know their HIV status, and hence, their decision to accept a HIV test is arguably less susceptible to bias.

Further, we introduced a ‘bound’ approach using data from MDICP, by which we estimated the plausible lower and upper bounds ( ${\overset{\land}{π}}_{3 -}, {\overset{\land}{π}}_{3 +}$ ) of the prevalence based on a set of weak and reasonable assumptions. This approach is potentially useful because it is often difficult to validate or falsify an underlying assumption. Furthermore, it shows that a carefully designed and implemented localised study may also be helpful for understanding the magnitude of non-response bias.

We also proposed two different methods using the ANC data. The first method ( ${\overset{\land}{π}}_{4}$ ) uses summary statistics from antenatal care units and combines them with census data to obtain prevalence estimates. An advantage of this approach is that no micro-data is needed and therefore the method can be implemented easily. The second method ( ${\overset{\land}{π}}_{5}$ ) combines the MDHS data with the ANC data to produce prevalence estimates. The novel feature of this method is the use of weights based on ANC data that adjust for non-ignorable missingness. Since ANC surveys are relatively free from refusal bias and are carried out at more frequent intervals than DHS, these two methods offer the possibility of obtaining prevalence estimates on a more contemporaneous basis.

In the presence of non-responses, all analytic methods require some assumptions and it is hard to determine what method is best. However, when there are available alternative methods, a way to go about addressing the refusal bias problem is to use all methods and compare their results. In the current study, the prevalence estimates range from 0.0935 to 0.1183 for men, from 0.1174 to 0.1556 for women, and 0.1059 to 0.1376 overall (see Table 3, last column). The relatively narrow range for men tells us that the refusal bias, if it exists at all, is practically not a major issue. The refusal bias for women may be larger but it is still small in absolute value and would be no larger than 3%. As these results indicate, the range reflects (the lack of) limits to which we can place our confidence in our results.

Our findings of acceptable level of refusal bias in the Malawi prevalence estimates can be contrasted from that reported in Obare,³⁵ where substantial potential bias is attributed to refusal/absence using the MDICP data. In that report, the percentage of HIV positive is 4.4 among those who accept an HIV test in both MDICP-3 and MDICP-4, compared to 15.5 and 13.0, respectively, for those who refuse or are absent for the test in MDICP-4. However, using our own analysis, we found this difference is due largely to those who already know their HIV test results from MDICP-3. Among those who do not know the results of the first-round HIV test, the proportion of people who refuse is similar between HIV-positive and HIV-negative individuals. In contrast, among those who know the results of the first-round HIV test, the proportion of people who refuse is substantially higher for HIV-positives than HIV-negatives. We may argue that a person who knows his/her HIV positive status is more likely to decline a second test because HIV positive status cannot be changed and the person may feel another test is meaningless. In our paper, the estimates ${\overset{\land}{π}}_{RE}$ and ${\overset{\land}{π}}_{2}$ are calculated using those who do not know their HIV status, whereas the bounds ${\overset{\land}{π}}_{3 -}$ and ${\overset{\land}{π}}_{3 +}$ explicitly allow for differences in refusal rates between those who know and those do not know their HIV status under a set of weak assumptions. The ANC surveys can be assumed to be free from refusal bias, and ${\overset{\land}{π}}_{4}$ uses this assumption to come up with refusal bias-free prevalence estimates; for ${\overset{\land}{π}}_{5}$ , the ANC data are used indirectly to create weights that adjust for refusals. None of the methods considered in this paper show a large upward adjustment from the weighted estimate ${\overset{\land}{π}}_{1}$ and the unadjusted estimate ${\overset{\land}{π}}_{CC}$ .

Supplemental Material

Supplemental material for Refusal bias in HIV data from the Demographic and Health Surveys: Evaluation, critique and recommendations

Supplemental Material for Refusal bias in HIV data from the Demographic and Health Surveys: Evaluation, critique and recommendations by Oyelola A Adegboye, Tomoki Fujii and Denis HY Leung in Statistical Methods in Medical Research

Footnotes

Acknowledgements

We acknowledge ORC Macro for granting us access to the MDHS data. We thank the Population Studies Center, University of Pennsylvania for providing us with the MDICP data. In particular, we gratefully acknowledge the help of Dr Philip Anglewicz for sending us the data and documentations for MDICP‐3 and MDICP‐4. The ANC data were obtained from the 2003 Malawi National AIDS Commission report. The census data were part of the 1998 and 2008 Population and Housing Census carried out by the National Statistical Office, Government of Malawi and made available by the Minnesota Population Center (Integrated Public Use Microdata Series, International: Version 6.1 [Machine‐readable database]. Minneapolis: University of Minnesota, 2011).

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Note

Supplemental material

Supplemental material for this article is available online.

References

Global HIV/AIDS response: epidemic update and health sector progress towards universal access: progress report 2011. Geneva: WHO, 2011.

Boerma

Ghys

Walker

. Estimates of HIV-1 prevalence from national population-based surveys as a new gold standard. Lancet 2003; 362: 1929–1931.

Garcia-Calleja

Gouws

Ghys

. National population-based HIV prevalence surveys in sub-Saharan Africa: results and implications for HIV and AIDS estimates. Sex Transm Infect 2006; 82: iii64–iii70.

Marston

Harriss

Slaymaker

. Non-response bias in estimates of HIV prevalence due to the mobility of absentees in national population-based surveys: a study of nine national surveys. Sex Transm Infect 2008; 84: i71–i77.

Groves

Dillman

Eltinge

, et al. Survey nonresponse, Chichester: Wiley, 2002.

Obare

Fleming

Anglewicz

, et al. Acceptance of repeat population-based voluntary counseling and testing for HIV in rural Malawi. Sex Transm Infect 2009; 85: 139–144.

Lynn

Clarke

. Separating refusal bias and non-contact bias: evidence from UK national surveys. The Statistician 2002; 51: 319–333.

Hawkes

Plewis

. Modelling non-response in the National Child Development Study. J R Stat Soc A 2006; 169: 479–491.

Billet

Philippens

Fitzgerald

, et al. Estimation of nonresponse bias in the European Social Survey: using information from reluctant respondents. J Official Stat 2007; 23: 135–162.

10.

Durrant

Steele

. Multilevel modelling of refusal and non-contact in household surveys: evidence from six UK Government surveys. J R Stat Soc A 2009; 172: 361–381.

11.

Lynn

. Non-response biases in surveys of schoolchildren: the case of the English Programme for International Student Assessment (PISA) samples. J R Stat Soc A 2012; 175: 915–938.

12.

Thomsen

Holmøy

AMK

. Combining data from surveys and administrative record systems: The Norwegian experience. Int Stat Rev 1998; 66: 201–221.

13.

Zanutto

Zaslavsky

Using administrative records to impute for nonresponse. In: Groves

Dillman

Eltinge

, et al.(eds). Survey non-response, Chichester: Wiley, 2002.

14.

Yucel

Zaslavsky

. Imputation of binary treatment variables with measurement error in administrative data. J Am Stat Assoc 2005; 100: 1123–1132.

15.

van den Berg

Lindeboom

Dolton

. Survey non-response and the duration of unemployment. J R Stat Soc A 2006; 169: 585–604.

16.

Stoop

. Survey nonrespondents. Field Methods 2004; 16: 23–54.

17.

Kreuter

Müller

Trappmann

. Nonresponse and measurement error in employment research: making use of administrative data. Public Opin Q 2010; 74: 880–906.

18.

Olson

. Do non-response follow-ups improve or reduce data quality?: a review of the existing literature. J R Stat Soc A 2013; 176: 129–145.

19.

Alho

. Adjusting for nonresponse bias using logistic regression. Biometrika 1990; 77: 617–624.

20.

Burton

Laurie

Lynn

. The long-term effectiveness of refusal conversion procedure on longitudinal surveys. J R Stat Soc A 2006; 169: 459–478.

21.

Rubin

. Multiple imputation for nonresponse in surveys, New York, NY: Wiley, 1987.

22.

Mishra

Barrere

Hong

, et al. Evaluation of bias in HIV seroprevalence estimates from national household surveys. Sex Transm Infect 2008; 84: i63–i70.

23.

Hogan

Salomon

Canning

, et al. National HIV prevalence estimates for sub-Saharan Africa: controlling selection bias with Heckman-type selection models. Sex Transm Infect 2012; 88: i17–i23.

24.

Heckman

. Sample selection bias as a specification error. Econometrica 1979; 47: 153–161.

25.

Reniers

Eaton

. Refusal bias in HIV prevalence estimates from nationally representative seroprevalence surveys. AIDS 2009; 23: 1–9.

26.

Floyd

Molesworth

Dube

, et al. Underestimation of HIV prevalence in surveys when some people already know their status, and ways to reduce the bias. AIDS 2013; 27: 233–242.

27.

National Statistical Office and ORC Macro. Malawi demographic and health survey 2004. National Statistical Office and ORC Macro, 2005.

28.

National AIDS Commission. HIV sentinel surveillance report. Ministry of Health and Population, Malawi, 2003.

29.

Pepe

Reilly

Fleming

. Auxiliary outcome data and the mean-score method. J Stat Plann Inference 1994; 42: 137–160.

30.

Horvitz

Thompson

. A generalization of sampling without replacement from a finite universe. J Am Stat Assoc 1952; 47: 663–685.

31.

The POLICY Project. Estimating national HIV prevalence in Malawi from sentinel surveillance data. The National AIDS Control Programme, Lilongwe, Malawi, 2001.

32.

Montana

Mishra

Hong

. Comparison of HIV prevalence estimates from antenatal care surveillance and population-based surveys in sub-Saharan Africa. Sex Transm Infect 2008; 84: i78–i84.

33.

Little RJA and Rubin DB. Statistical analysis with missing data. New York, NY: Wiley, 2002.

34.

National AIDS Commission. HIV and syphilis sero-survey and national HIV prevalence and AIDS estimates report for 2007. Ministry of Health, Malawi, 2008.

35.

Obare

. Nonresponse in repeat population-based voluntary counseling and testing for HIV in rural Malawi. Demography 2010; 47: 651–665.

36.

National AIDS Commission. Estimating national HIV prevalence in Malawi from Sentinel surveillance data. Technical Report, Ministry of Health and Population, Malawi, 2003.

37.

Kigadye

Klokke

Nicoll

, et al. Sentinel surveillance for HIV-1 among pregnant women in a developing country: 3 years' experience and comparison with a population serosurvey. AIDS 1993; 7: 849–855.

38.

Fylkesnes

Ndhlovu

Kasumba

, et al. Studying dynamics of the HIV epidemic: population-based data compared with sentinel surveillance in Zambia. AIDS 1998; 12: 1227–1242.

39.

Glynn

BACM Jr

Musonda

Kahindo

, et al. Factors influencing the difference in HIV prevalence between antenatal clinic and general population in sub-Saharan Africa. AIDS 2001; 15: 1717–1725.

40.

Asamoah-Odei

Garcia Calleja

Boerma

. HIV prevalence and trends in sub-Saharan Africa: no decline and large subregional differences. Lancet 2004; 364: 35–40.

41.

Chen

. A robust imputation method for surrogate outcome data. Biometrika 2000; 87: 711–716.

42.

Zaba

Carpenter

Boerma

, et al. Adjusting ante-natal clinic data for improved estimates of HIV prevalence among women in sub-Saharan Africa. AIDS 2000; 14: 2741–2750.

43.

Gregson

Terceira

Kakowa

, et al. Study of bias in antenatal clinic HIV-1 surveillance data in a high contraceptive prevalence population in sub-Saharan Africa. AIDS 2002; 16: 643–652.

44.

Gouws

Mishra

Fowler

. Comparison of adult HIV prevalence from national population-based surveys and antenatal clinic surveillance in countries with generalised epidemics: implications for calibrating surveillance data. Sex Transm Infect 2008; 84: i17–i23.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.19 MB