Screening for diabetes mellitus in the US population using neural network-based modeling and complex survey designs

Abstract

Complex survey designs are widely used in medical cohort studies. Developing risk score models that adequately account for the sampling design is essential to minimize selection bias and obtain representative population estimates. This work addresses three complementary objectives. First, we propose a general predictive framework for regression and classification tasks that utilizes neural networks to incorporate survey weights into the model estimation process. Second, we introduce a procedure for quantifying prediction uncertainty based on conformal inference, adapted to the characteristics of complex survey data. Third, we demonstrate the application of the proposed methodology in a case study assessing the risk of diabetes mellitus in the US population, using the NHANES 2011–2014 cohort. The empirical results show that models of varying complexity, each using different sets of predictors, achieve different trade-offs between predictive performance and economic cost while maintaining generalizability at the population level. Although the case study focuses on diabetes, the proposed framework is directly applicable to the development of clinical prediction models for other diseases and complex survey datasets. All software and data used in this study are publicly available on GitHub.

Keywords

Survey data neural networks diabetes disease scores

1. Introduction

Scientific experimental design has been identified as a key factor in the biomedical reproducibility crisis, as highlighted by several studies.^1–3 Although some practitioners emphasize the importance of study design, its influence is often underestimated. The significance of robust experimental design dates back to Ronald Fisher’s pioneering work in statistics⁴ and remains a central topic in modern statistical science, particularly in the context of dynamic and adaptive designs. In medical research, a sound experimental design is essential for developing and validating new drugs and clinical treatments, especially within randomized clinical trials.⁵

In addition, the survey sampling methods used in nationally representative studies, such as the National Health and Nutrition Examination Survey (NHANES), greatly influence the generalizability of clinical predictive models.^6–8 NHANES, renowned for its reliability and comprehensive data collection, employs a multistage complex sampling design to select a representative sample of the US civilian non-institutionalized population. This approach incorporates hierarchical selection stages—from states to cities—and applies post-stratification techniques to mitigate non-response bias and improve population representativeness, thereby enhancing the efficiency of statistical estimators.

In contrast, other clinical datasets, such as the UK Biobank, often suffer from selection bias due to the voluntary nature of participant recruitment.^9,10 However, the sophisticated experimental protocol of the NHANES supports more reliable conclusions, serves as a benchmark for monitoring health behaviors in the US population, and is routinely used in policy-making and clinical surveillance. The success of statistical analyses in such studies depends heavily on adopting methods that adequately account for the underlying sampling design.

Developing risk scores is critical for healthcare planning, particularly in public health, as these scores estimate the likelihood of disease onset and help identify individuals at elevated risk.^11–14 Based on disease score assessments, healthcare strategies can be implemented, such as customized follow-up, routine checks, and non-invasive interventions, to reduce healthcare costs and improve population health outcomes.^15–17

However, constructing disease-specific risk scores from observational data is inherently vulnerable to selection bias, limiting the generalizability of the findings.⁹ This issue is especially prominent in diabetes research, where the predictive performance of risk scores can vary substantially across studies due to differences in genetic background, demographic factors, and study design. Practical limitations, including high costs and technical constraints, often hinder the implementation of efficient random sampling designs.

Despite their widespread use, the integration of such survey designs with modern machine learning techniques and survey-weighted inference remains limited. Furthermore, the inferential properties of these models continue to pose significant challenges. To bridge this gap and potentially improve clinical conclusions in biomedical research (see Table 1 for the NHANES case), we propose a neural network-based prediction framework that incorporates uncertainty quantification via conformal prediction techniques. We apply this methodology to develop reliable risk scores for detecting diabetes in the US population.

Table 1.
Impacts of not using survey weights in NHANES data analysis.

Aspect Impact of not using correct NHANES survey weights

Bias in estimates Biased statistical results, misrepresenting certain groups

Loss of representativeness Results not reflective of the US civilian non-institutionalized population

Policy decisions Potential misinformed public health policies and resource allocation

Statistical inaccuracy Incorrect standard errors, confidence intervals, and $p$ -values

Ethical considerations Ethical concerns due to disproportionate exclusion of minority groups

Aspect	Impact of not using correct NHANES survey weights
Bias in estimates	Biased statistical results, misrepresenting certain groups
Loss of representativeness	Results not reflective of the US civilian non-institutionalized population
Policy decisions	Potential misinformed public health policies and resource allocation
Statistical inaccuracy	Incorrect standard errors, confidence intervals, and $p$ -values
Ethical considerations	Ethical concerns due to disproportionate exclusion of minority groups

NHANES: National Health and Nutrition Examination Survey.

1.1. Diabetes study case

Diabetes mellitus represents a significant public health challenge, currently affecting approximately $12 %$ of the US population. A notable concern, particularly in type-2 diabetes, is the high rate of undiagnosed cases. The Centers for Disease Control and Prevention reported in 2020 that approximately $21 %$ of diabetes cases in the United States remained undetected.¹⁸ Recent research suggests that this percentage could be even higher.¹⁹

The prevalence of sedentary lifestyles, especially in developed countries, contributes to a worrying projected increase in diabetes cases. Currently affecting approximately $9.3 %$ of the global population, projections suggest an increase to $10.2 %$ by 2030 and $10.9 %$ by 2045. This growing trend highlights the pressing need for effective public health strategies.

The economic impact of diabetes is substantial. For example, in 2022, the total cost in the United States was estimated at $412.9 billion, including $306.6 billion in direct medical expenses and $106.3 billion in indirect costs.¹⁶ The burden of diabetes extends beyond financial costs, profoundly affecting life expectancy and quality of life, largely due to delayed diagnoses and inadequate glycemic control, which frequently result in severe complications such as cardiovascular diseases. Addressing these challenges requires advancing precision medicine by implementing data-intensive frameworks that enable the systematic development of diagnostic tools and therapeutic strategies, informed by continuous, high-resolution data from wearable biosensors and electronic health records.

In the case of diabetes mellitus, the diagnosis typically involves biomarkers such as glycosylated hemoglobin (HbA1c) and fasting plasma glucose (FPG). While FPG tests are cost-effective, HbA1c testing, a primary biomarker for diabetes, is more expensive and complex. Despite its widespread use, particularly in non-risk groups, HbA1c testing presents significant challenges. Medical guidelines, including those from the American Diabetes Association (ADA), recommend using both HbA1c and FPG, along with oral glucose tolerance tests, especially in cases of gestational diabetes,²⁰ for a comprehensive assessment.

In disease diagnosis and screening, statistical and machine learning models that leverage clinical variables show promise for identifying high-risk individuals. These predictive models can effectively stratify patients, allowing personalized clinical approaches in accordance with precision medicine. However, their predictive power can vary across patient demographics and in real-world applications with limited sample sizes, posing challenges for accurately determining a patient’s glucose status.

In this paper, we address these challenges using data from the NHANES from 2011 to 2014. Given the analytical challenges of properly analyzing complex survey designs and the relative scarcity of non-parametric models in this domain, we propose a neural network predictive framework. The related estimators are designed to ensure universal approximation capabilities in predictive tasks. Furthermore, we introduce an uncertainty quantification framework based on conformal inference for survey data, thereby enhancing our ability to quantify the predictive limits of these models. This holistic approach aims to improve the precision and applicability of predictive models to detect diabetes mellitus.

1.2. Machine learning models for survey data

The application of non-parametric regression models in survey data analysis remains a relatively unexplored area.²¹ Traditional approaches, such as the Nadaraya–Watson estimator²² and local polynomial regression, have been proposed. However, their statistical performance tends to deteriorate rapidly as the number of predictors increases, even when considering as few as three covariates.²³

Recently, machine learning algorithms—such as neural networks and random forests—have been recommended in the literature as non-parametric alternatives to classical regression models. However, their generalization to inferential and predictive tasks in complex survey designs has not yet been adequately addressed. To the best of our knowledge, the closest related work is the kernel ridge regression models proposed in Matabuena and Petersen⁷ for functional and distributional data analysis applications.⁶

The goal of this paper is to address this gap in the literature by proposing a neural network framework for classification and regression models, equipped with automated uncertainty quantification for model predictions.

Given the common prevalence of non-linear biological data, the universal approximation properties of neural networks, combined with their competitive performance in high-dimensional settings, make them excellent candidates for modeling tasks across various domains. In particular, for biomedical applications, the proposed predictive framework offers a novel opportunity to develop accurate and reliable risk scores for specific diseases.

1.3. Summary of contributions

Building on the motivation outlined in the Introduction, this study seeks to address the methodological and practical challenges of applying modern machine learning techniques to complex survey data, ensuring both statistical efficiency and population-level generalizability. The purpose of this paper is to construct disease risk scores for predictive modeling using nationally representative health datasets, with the NHANES serving as a case study to estimate disease risk. The specific objectives are as follows.

Develop a neural network modeling strategy for complex survey data. This includes the explicit incorporation of survey weights in model training and evaluation, tailored to the construction of disease risk scores for nationally representative cohorts such as the NHANES.

Predictive uncertainty quantification for survey designs. We adapt conformal inference techniques to complex survey settings, acknowledging that assumptions such as exchangeability and finite-sample properties may be violated. However, algorithms can retain asymptotic consistency, provided that the underlying deep learning models exhibit this property for each estimator.

Extend the framework to both regression and classification tasks. In particular, we introduce a novel neural network-based quantile regression approach that enables survey-weighted conformal prediction, addressing a gap in the literature on non-linear quantile regression for survey data.²⁴

Apply the framework to type-2 diabetes risk prediction. We compare models of varying complexity and economic cost of implementation and analyze trade-offs between predictive precision and resource requirements between subgroups of the population.

Ensure reproducibility and accessibility. The complete Python implementation of all proposed methods is openly available at https://github.com/juancarlosvidal/survey_data_diabetes.

1.4. Outline of the paper

The structure of the paper is as follows. Section 4 introduces the proposed neural network methodology designed specifically for survey data, together with the proposed approach for quantification of uncertainty in model prediction. In Section 5, we present a simulation study in a binary classification setting. Section 6 applies the models to a case study on diabetes using the NHANES data. Section 7 discusses the methodological contributions, broader applications in biomedical research, and the practical implications of our findings in the case study.

2. Methodology

2.1. Model estimation

Suppose that we observe a random sample $D_{n} = {(X_{i}, Y_{i}) \sim P}_{i = 1}^{n}$ drawn from a finite population $M$ of size $N$ , according to a complex survey design. For each unit of observation $i$ $(i = 1, \dots, n)$ , we associate a weight $w_{i}$ that reflects the number of population units represented by the $i$ th sampled unit. We assume that $w_{i} = 1 / π_{i}$ , where $π_{i}$ denotes the probability of selecting the unit $i$ under the sampling design. In our scientific application, the final weights $w_{i}$ incorporate post-stratification corrections to account for non-response, in accordance with the NHANES survey design guidelines (https://www.cdc.gov/nchs/nhanes/tutorials/weighting.aspx).

We assume that each predictor $X_{i} \in R^{p}$ represents either originally continuous covariates or categorical variables that have been appropriately encoded (e.g. by one-hot or effect coding). The response variable $Y_{i}$ belongs to a set of categories ${1, \dots, K}$ in a multiclass classification setting or is a continuous scalar variable $Y_{i} \in R$ in the case of regression. For practical purposes, the modeling framework remains unchanged, with the only difference being the choice of loss function $ℓ$ appropriate to the type of outcome.

The deep neural network model²⁵ is a parametric function $f : R^{p} \to R^{m}$ , defined as a composition of $L$ successive transformations (or layers) such that

Y = f (X_{i}; θ) = f^{(L)} \circ f^{(L - 1)} \circ \dots \circ f^{(1)} (X_{i}),

where

θ = {ω^{(l)}, b^{(l)}}_{l = 1}^{L}

denotes the set of learnable parameters, with

ω^{(l)} \in R^{d_{l} \times d_{l - 1}}

and

b^{(l)} \in R^{d_{l}}

. The input and output dimensions are given by

d_{0} = p

and

d_{L} = m

, respectively, while the intermediate layer widths

d_{1}, \dots, d_{L - 1}

are treated as hyperparameters of the network architecture. Each transformation

f^{(l)}

corresponds to the operation performed by the

l

th layer:

f^{(l)} (z) = σ^{(l)} (ω^{(l)} z + b^{(l)}),

where

z \in R^{d_{l - 1}}

is the input vector for layer

l

, and

σ^{(l)} : R \to R

denotes the activation function applied element-wise (or row-wise in the case of softmax). In this work, we use the ReLU function for all hidden layers (

l = 1, \dots, L - 1

σ^{(l)} (z_{j}) = max (z_{j}, 0),

where

z_{j}

denotes the

j

th coordinate of the vector

z

. For the output layer (

l = L

), we use the softmax function to produce a probability distribution on the output classes:

σ^{(L)} (z_{j}) = \frac{\exp (z_{j})}{\sum_{k = 1}^{d_{L}} \exp (z_{k})} .

We denote the intermediate activations (also called hidden representations) as

\begin{aligned} h^{(0)} & = X_{i}, \\ h^{(l)} & = f^{(l)} (h^{(l - 1)}), for l = 1, \dots, L, \\ Y_{i} & = h^{(L)} . \end{aligned}

Given the final output of the hidden layer $h^{(L)}$ and the observed label $Y_{i}$ , we estimate the parameters of the model by minimizing a weighted loss function. Specifically, we adopt a weighted cross-entropy loss to account for survey weights. The set of parameters $θ$ is obtained by solving the following optimization problem:

\begin{aligned} \hat{θ} & = \arg min_{θ} \sum_{i = 1}^{n} w_{i} \cdot ℓ (Y_{i}, \log f (X_{i}; θ)) \\ = \arg min_{θ} - \sum_{i = 1}^{n} \sum_{k = 1}^{K} I {Y_{i} = k} \cdot w_{i} \cdot \log f_{k} (X_{i}; θ) . \end{aligned}

In the context of complex survey data, estimators that incorporate sampling weights—such as the Horvitz–Thompson estimator—are commonly used to produce design-consistent and unbiased estimates. By accounting for unequal selection probabilities, these methods improve both the efficiency and the representativeness of the resulting estimates. Consequently, our approach integrates the survey weights directly into the neural network training process by embedding them within the loss function.

2.2. Computational details

For the $i$ th unit of observation, we denote the predicted label as ${\tilde{Y}}_{i}$ and the estimated conditional probability of having diabetes as ${\tilde{p}}_{i} \approx P (Y_{i} = 1 ∣ X_{i})$ . Let $D$ denote the set of indices corresponding to subjects with diabetes, and $\bar{D}$ the set of indices for non-diabetic subjects. The performance metrics used to evaluate the model are summarized in Table 2. All metrics were calculated using survey weights to ensure unbiased and design-consistent population-level estimates.

Table 2.
Performance metrics for binary classification with survey weights.

Metric Formula Description

AUC $\sum_{i \in D} \sum_{j \in \bar{D}} \frac{w_{i} w_{j}}{\sum_{i \in D} \sum_{j \in \bar{D}} w_{i} w_{j}} k ({\tilde{p}}_{i}, {\tilde{p}}_{j})$ Weighted area under the ROC curve. The kernel $k (\cdot, \cdot)$ compares predicted probabilities of positive and negative instances.

Accuracy $\frac{T P + T N}{T P + T N + F P + F N}$ Proportion of correctly classified instances among all samples.

Recall $\frac{T P}{T P + F N}$ Also known as sensitivity or true-positive rate. Measures the proportion of actual positives correctly identified.

Precision $\frac{T P}{T P + F P}$ Also called positive predictive value. Measures the proportion of predicted positives that are true positives.

$F 1$ -score $\frac{2 \cdot Precision \cdot Recall}{Precision + Recall}$ Harmonic mean of precision and recall. Balances both metrics in a single score.

Cross-entropy $- \sum_{i} w_{i} [y_{i} \log ({\tilde{p}}_{i}) + (1 - y_{i}) \log (1 - {\tilde{p}}_{i})]$ Weighted log-loss function measuring the discrepancy between predicted probabilities and true labels.

Metric	Formula	Description
AUC	$\sum_{i \in D} \sum_{j \in \bar{D}} \frac{w_{i} w_{j}}{\sum_{i \in D} \sum_{j \in \bar{D}} w_{i} w_{j}} k ({\tilde{p}}_{i}, {\tilde{p}}_{j})$	Weighted area under the ROC curve. The kernel $k (\cdot, \cdot)$ compares predicted probabilities of positive and negative instances.
Accuracy	$\frac{T P + T N}{T P + T N + F P + F N}$	Proportion of correctly classified instances among all samples.
Recall	$\frac{T P}{T P + F N}$	Also known as sensitivity or true-positive rate. Measures the proportion of actual positives correctly identified.
Precision	$\frac{T P}{T P + F P}$	Also called positive predictive value. Measures the proportion of predicted positives that are true positives.
$F 1$ -score	$\frac{2 \cdot Precision \cdot Recall}{Precision + Recall}$	Harmonic mean of precision and recall. Balances both metrics in a single score.
Cross-entropy	$- \sum_{i} w_{i} [y_{i} \log ({\tilde{p}}_{i}) + (1 - y_{i}) \log (1 - {\tilde{p}}_{i})]$	Weighted log-loss function measuring the discrepancy between predicted probabilities and true labels.

AUC: area under the curve; ROC: receiver operating characteristic.

Model parameters are optimized using stochastic gradient descent in conjunction with the Adam optimizer,²⁶ which provides adaptive learning rates for efficient convergence. The optimization process is implemented using PyTorch’s automatic differentiation capabilities.

As a parametric benchmark, we employed survey-weighted logistic regression models. All baseline metrics were computed using survey weights that matched those used in the neural-network evaluation.

2.3. Uncertainty quantification for survey data via conformal prediction

2.3.1. Background on conformal prediction and uncertainty quantification

In predictive medical models, considerable uncertainty is common. Patient responses often vary over time and can show individualized patterns.^15,27 Although uncertainty is often perceived negatively, it can provide valuable information for clinical decision-making.¹⁵

First, it helps clarify the limitations of predictive models. Second, high uncertainty in patient outcomes can stimulate the development of novel pharmacological treatments or interventions. Third, uncertainty supports healthcare planning: patients with unpredictable clinical courses may require more frequent follow-up. Ultimately, estimating individual-level uncertainty is crucial to identifying atypical cases and to effectively allocating healthcare resources. Each of these factors underscores the value of uncertainty quantification in clinical practice. We now introduce conformal prediction models. To keep the explanation simple and clear, we begin with the case of continuous scalar outcomes.

Let ${(X_{i}, Y_{i})}_{i = 1}^{n}$ denote a random sample to fit the statistical models (not necessarily independent and identically distributed), which we use to construct a prediction region for $Y$ conditional on $X$ . Following Barber et al.,²⁸ our goal is to develop a general conformal prediction strategy that goes beyond the assumption of exchangeability between observations and can be applied to survey data.

In modern prediction inference theory, one often seeks a finite-sample guarantee of the form

P (Y_{n + 1} \in {\tilde{C}}^{α} (X_{n + 1})) \geq 1 - α,

where

{\tilde{C}}^{α} (X_{n + 1})

is a prediction interval constructed from training data

D_{n}

(X_{n + 1}, Y_{n + 1})

is a new random point,

P

is the probability over

D_{n} \cup (X_{n + 1}, Y_{n + 1})

, and

1 - α

is the desired coverage level (e.g.

α = 0.05

). Classical conformal prediction methods are notable because they achieve this marginal property in any finite-sample regimes under the sampling mechanism assumption of independent and identically distributed data (i.i.d. setting), that is, without requiring asymptotics and large sample sizes.

However, in complex survey data, this finite-sample validity guarantee typically does not hold even if survey weights are directly incorporated into the conformal construction. In such cases, the marginal coverage may deviate substantially from the nominal level. However, the predictive algorithm we propose for split data retains statistical consistency: as $n \to \infty$ , the marginal coverage of ${\tilde{C}}^{α} (X_{n + 1})$ converges to the target level $1 - α$ .

Let

\begin{aligned} D_{n} = {(X_{i}, Y_{i})}_{i = 1}^{n} \subset R^{p} \times R \end{aligned}

denote a random sample, where the observations

(X_{i}, Y_{i})

are i.i.d. for simplicity. Given a new i.i.d. observation

(X_{n + 1}, Y_{n + 1})

, conformal prediction—introduced by Vovk et al.²⁹—offers a general set of procedures for constructing prediction intervals that are valid in finite samples and do not depend on the choice of regression algorithm.

Assume a regression algorithm $A$ that, given a dataset $D_{n} = {(X_{i}, Y_{i})}_{i = 1}^{n}$ , produces a fitted regression function $\tilde{m} : R^{p} \to R$ . That is, the algorithm takes as input the predictors $X_{i}$ and responses $Y_{i}$ from the training sample and outputs a function $\tilde{m}$ mapping any new covariate vector $X \in R^{p}$ to a predicted outcome $\tilde{m} (X) \in R$ . Here, $R^{p}$ denotes the predictor space and $R$ the outcome space.

The algorithm $A$ must treat the data points symmetrically, that is,

\begin{aligned} A ((x_{π (1)}, y_{π (1)}), \dots, (x_{π (n)}, y_{π (n)})) = A ((x_{1}, y_{1}), \dots, (x_{n}, y_{n})) \end{aligned}

(1)

for all

n \geq 1

, all permutations

π

[n] = {1, \dots, n}

, and all datasets

{(x_{i}, y_{i})}_{i = 1}^{n}

. For each

y \in R

, let

{\tilde{m}}^{y} = A ((X_{1}, Y_{1}), \dots, (X_{n}, Y_{n}), (X_{n + 1}, y))

denote the regression model fitted with the additional point

(X_{n + 1}, y)

Define the conformity scores as

S_{i}^{y} = {\begin{cases} | Y_{i} - {\tilde{m}}^{y} (X_{i}) |, & i = 1, \dots, n, \\ | y - {\tilde{m}}^{y} (X_{n + 1}) |, & i = n + 1, \end{cases}

(2)

where

{\tilde{m}}^{y}

denotes the fitted regression function trained on the augmented dataset that includes the candidate point

(X_{n + 1}, y)

Then, the conformal prediction interval for $X_{n + 1}$ is

\begin{aligned} {\tilde{C}}^{α} (X_{n + 1}; D_{n}) = {y \in R : S_{n + 1}^{y} \leq {quant}_{1 - α} ({\tilde{F}}_{n + 1})}, \end{aligned}

(3)

where

{\tilde{F}}_{n + 1}

denotes the empirical distribution of the conformity scores

{S_{i}^{y}}_{i = 1}^{n + 1}

, that is,

{\tilde{F}}_{n + 1} (r) = \frac{1}{n + 1} \sum_{i = 1}^{n + 1} 1 {S_{i}^{y} \leq r} .

Here,

{quant}_{1 - α} ({\tilde{F}}_{n + 1})

is the empirical quantile of

(1 - α)

of the conformity scores

{S_{i}^{y}}_{i = 1}^{n + 1}

The full conformal method provides distribution-free coverage guarantees:

Theorem 1 Full Conformal Prediction²⁹

If the data points $(X_{1}, Y_{1}), \dots, (X_{n}, Y_{n}), (X_{n + 1}, Y_{n + 1})$ are i.i.d. (or more generally, exchangeable), and the algorithm treats data symmetrically as in (1), then the conformal prediction set (3) satisfies

P (Y_{n + 1} \in {\tilde{C}}^{α} (X_{n + 1}; D_{n})) \geq 1 - α .

The same guarantee holds for the split conformal method, which separates training and calibration data into two samples to estimate the regression function and the quantile on the calibration set, respectively.

Conformal inference (both split and full) is often described in terms of conformity scores ${S_{i}^{y}}_{i = 1}^{n + 1}$ , which quantify how well each data point aligns with a fitted regression function. In the standard case, the conformity score reduces to the absolute residual:

S_{i}^{y} = | Y_{i} - \tilde{m} (X_{i}) |, i = 1, \dots, n,

(4)

with the candidate point scored as

S_{n + 1}^{y} = | y - \tilde{m} (X_{n + 1}) | .

Here,

\tilde{m}

denotes the regression function fitted on the appropriate subset of the training data (in the split method) or the expanded dataset including

(X_{n + 1}, y)

(in the full method).

2.3.2. NHANES study design

The NHANES cohorts use a sophisticated multistage probability sampling design to represent the US civilian non-institutionalized population. Sampling begins with Primary Sampling Units (PSUs), typically counties or groups of counties, selected using a probability proportional to size. PSUs are subdivided, households are sampled, and individuals are chosen using stratification criteria such as age, sex, and race/ethnicity.

However, demographic and geographic heterogeneity between states, as well as differences in ethnic makeup, socioeconomic status, and health outcomes, may violate the exchangeability assumption. This poses challenges for applying conformal prediction algorithms that rely on exchangeability or i.i.d. assumptions for finite-sample validity.

2.3.3. Conformal prediction under covariate shift for survey data

We depart from exchangeability and assume a covariate-shift framework in which the conditional outcome law is stable between the observed (sample) and target (population) domains, while the covariate distribution may change. Formally,

\begin{aligned} (X_{1}, Y_{1}), \dots, (X_{n}, Y_{n}) & \sim P = P_{X} \cdot P_{Y | X}, \\ (X_{n + 1}, Y_{n + 1}) & \sim \tilde{P} = {\tilde{P}}_{X} \cdot P_{Y | X}, \end{aligned}

so that the marginal distributions of the covariates,

P_{X}

and

{\tilde{P}}_{X}

, may differ, while the conditional law

P_{Y | X}

remains the same.

To capture this discrepancy, we define the weighting function

w (x) = \frac{{\tilde{p}}_{X} (x)}{p_{X} (x)},

where

p_{X}

and

{\tilde{p}}_{X}

denote the densities of the covariate distributions

P_{X}

and

{\tilde{P}}_{X}

, respectively, with respect to a common dominating measure (e.g. Lebesgue measure for continuous variables or counting measure for discrete variables).

This function $w (x)$ , often referred to as the density ratio, quantifies how the target covariate distribution differs from the sample distribution and therefore depends only on the covariates $x$ .

In the context of complex surveys, such as the NHANES, this weighting function is known by design. The person responsible for the experimental design derived the survey weights $w_{i}$ for each sampled unit; these weights were constructed to be proportional to the inverse inclusion probability and were subsequently adjusted for non-response and calibration. Thus, $w_{i}$ serves as an empirical realization of $w (X_{i})$ , and the same survey mechanism determines $w (x)$ for a new prediction point, either directly from published tables or through the documented weighting rules. A standard assumption in this framework is that sampling is non-informative given $X$ , meaning that inclusion may depend on covariates but not on the outcome conditional on covariates, so that $P_{Y | X}$ is common to $P$ and $\tilde{P}$ .

To construct weighted conformal sets, let $\tilde{m}$ be a predictor fitted to the training data and define residual conformity scores

\begin{aligned} S_{i} & = | Y_{i} - \tilde{m} (X_{i}) |, i = 1, \dots, n, \\ S_{n + 1} (y) & = | y - \tilde{m} (X_{n + 1}) | . \end{aligned}

Let $W (x) = \sum_{j = 1}^{n} w (X_{j}) + w (x)$ . The weighted empirical distribution of residuals, including the test point, is

\begin{aligned} {\tilde{F}}_{n + 1}^{w} (s; x) & = \frac{1}{W (x)} (\sum_{i = 1}^{n} w (X_{i}) 1 {S_{i} \leq s} + w (x) 1 {S_{n + 1} (y) \leq s}) . \end{aligned}

Define the weighted quantile

Q_{1 - α}^{w} (x) = inf {s \in R_{+} : {\tilde{F}}_{n + 1}^{w} (s; x) \geq 1 - α},

and the conformal prediction set,

{\tilde{C}}^{α} (x) = {y \in R : S_{n + 1} (y) \leq Q_{1 - α}^{w} (x)} .

Theorem 2 Tibshirani et al.³⁰

Consider the data-generating scheme with a common conditional law $P_{Y | X}$ across the sample and target domains, and assume that both $P_{X}$ and ${\tilde{P}}_{X}$ admit densities $p_{X}$ and ${\tilde{p}}_{X}$ with respect to a dominating measure such that ${\tilde{p}}_{X} (x) > 0 \Rightarrow p_{X} (x) > 0$ . Define

w (x) = \frac{{\tilde{p}}_{X} (x)}{p_{X} (x)} .

Suppose further that: (i) sampling is non-informative given

X

(inclusion may depend on

X

but not on

Y

conditional on

X

); (ii) the design weights satisfy

w_{i} \propto w (X_{i})

up to a common scaling constant (the construction of

{\tilde{F}}_{n + 1}^{w}

is invariant to positive rescaling); and (iii) ties in quantiles are resolved via randomization.

Then,

P (Y_{n + 1} \in {\tilde{C}}^{α} (X_{n + 1})) \geq 1 - α,

where probability is taken over

(X_{1 : n}, Y_{1 : n}) \sim P

and

(X_{n + 1}, Y_{n + 1}) \sim \tilde{P}

For the classification prediction case, in the setting of the case study, we adapt the quantile classification algorithm of Cauchois et al.³¹ to handle survey weights and covariate drift (see Algorithm 1 for further details).

2.3.4. Conformal prediction beyond exchangeability

More generally, Barber et al.²⁸ introduced conformal methods for non-exchangeable data. A key concept is the coverage gap, which quantifies the deviation from the nominal level:

Coverage gap = (1 - α) - P (Y_{n + 1} \in {\tilde{C}}^{α} (X_{n + 1})) .

Let

Z = (Z_{1}, \dots, Z_{n + 1})

denote the full data and

Z^{i}

the same data with the

i

th point swapped with

(X_{n + 1}, Y_{n + 1})

. Using weights

w_{i} \in [0, 1]

assigned to each point, the coverage gap can be bounded by:

Coverage gap \leq \frac{\sum_{i = 1}^{n} w_{i} \cdot d_{TV} (Z, Z^{i})}{1 + \sum_{i = 1}^{n} w_{i}},

where

d_{TV} (\cdot, \cdot)

is the total variation distance.

In practice, this gap quantifies the extent to which complex survey sampling deviates from the ideal assumption of exchangeability. To formalize inference, it is common to adopt a super-population framework, where the finite population (such as the US residents captured by the NHANES at a given time) is viewed as one realization from a much larger, conceptual population distribution. Intuitively, we imagine that the observed sample comes from a very large population generated by the same underlying process. Within this framework, if the estimators are statistically consistent, the empirical coverage gap decreases with increasing sample size, and under asymptotic consistency, it converges to zero as the finite-sample population increases to zero, $n \to \infty$ . This reflects the fact that, although the survey design breaks strict exchangeability in finite samples, its impact diminishes with larger $n$ , and in a super-population framework, it asymptotically vanishes, restoring nominal coverage.

3. Simulation study

The objective of this section is to verify and validate the implementation of the neural network and to compare its performance against a baseline linear logistic regression method. This comparison aims to highlight the flexibility of NNs when the logistic model is misspecified, particularly in settings where the true interaction structure between variables is unknown and the relationships are not merely additive and linear. The survey sampling scenarios considered here should not be interpreted as an attempt to exactly reproduce the NHANES design. Rather, they serve as a structured, transparent framework for assessing the robustness of the neural network approach under complex survey sampling conditions.

To this end, we conducted a simulation study employing a two-stage cluster sampling design, based on two distinct generative models for binary regression, where $Y \in {0, 1}$ . In both scenarios, the predictor variable is defined as $X = (X_{1}, \dots, X_{5})^{⊤} \in R^{5}$ , with $X \sim N_{5} (μ, Σ)$ , where $N_{5} (μ, Σ)$ denotes a five-dimensional multivariate normal distribution with mean vector $μ$ and covariance matrix $Σ$ . For simplicity, we set $μ = 0_{5 \times 1}$ , a five-dimensional zero vector, and $Σ = I_{5 \times 5}$ , the identity matrix $5 \times 5$ .

The conditional distribution of the response variable is given by $Y ∣ X \sim Ber (π (X))$ , where $π (X) = P (Y = 1 ∣ X) \in [0, 1]$ is the success probability, specified through the following two functional forms:

(a)
$π (X) = \frac{1}{1 + \exp (- 2 + \sum_{i = 1}^{3} X_{i} + X_{4} X_{5})}$ .
(b)
$π (X) = \frac{1}{1 + \exp (- 3 + \sum_{i = 1}^{3} X_{i})}$ .

The experimental design is based on a hierarchical cluster structure. In the first stage, the groups correspond to states, each of which is independently selected with probability $4 / 5$ . In the second stage, clusters correspond to cities within the selected states, where each city in state $i$ is chosen with equal probability $1 / n_{i}$ , with $n_{i}$ denoting the total number of cities in state $i$ . Finally, within the sampled city, all patients are assumed to have the same probability of selection. Consequently, the overall probability of selecting an individual $r$ from the city $j$ in the state $i$ is
$π_{r} = \frac{4}{5} \cdot \frac{1}{n_{i}} \cdot \frac{1}{N_{i j}},$
where $N_{i j}$ is the number of individuals in the city $j$ of the state $i$ . The sampling mechanism and computation of the selection probabilities were implemented using the survey package in R.

For each generative model (a and b), we replicated the experimental design 500 times ( $b = 1, \dots, B = 500$ ) at varying sample sizes $n \in {$ 5000, 10,000, 20,000 $}$ . Performance was evaluated using several metrics: (i) area under the curve (AUC), (ii) accuracy, (iii) recall, (iv) precision, (v) $F 1$ -score, and (vi) cross-entropy. Comparisons were drawn between the neural network approach and a competing method, specifically a weighting-survey logistic regression model.

The results, reported as mean standard deviation $\pm$ in the simulations, are summarized in Table 3. As expected, in scenario (b), where the underlying logistic regression model represents the ground truth, the accuracy of the neural network and the parametric logistic regression models are similar, particularly at larger sample sizes. However, in scenario (a), characterized by non-linear interactions, the neural network consistently outperforms the logistic model. Although the specific interaction $X_{4} X_{5}$ in scenario (a) could, in principle, be incorporated into a logistic regression model if known a priori, we deliberately restricted the logistic specification to main effects only, to emulate a realistic setting where the true interaction structure is unknown.

Table 3.
Summary of simulation results across scenarios (a) and (b) with varied sample sizes $N \in {5000, 10,000, 20,000}$ for the logistic regression (LR) and the multilayer perceptron (MLP) neural network.

Scenario (a) Scenario (b)

$N = 5000$ $N = 10,000$ $N = 20,000$ $N = 5000$ $N = 10,000$ $N = 20,000$

LR AUC $0.9325 \pm 0.0182$ $0.9328 \pm 0.0161$ $0.9332 \pm 0.0142$ $0.9998 \pm 0.0011$ $0.9998 \pm 0.0008$ $0.9999 \pm 0.0007$

Accuracy $0.9039 \pm 0.0178$ $0.9041 \pm 0.0153$ $0.9048 \pm 0.0135$ $0.9980 \pm 0.0030$ $0.9985 \pm 0.0024$ $0.9989 \pm 0.0021$

Precision $0.8438 \pm 0.04458$ $0.8485 \pm 0.0390$ $0.8506 \pm 0.0344$ $0.9968 \pm 0.0071$ $0.9976 \pm 0.0057$ $0.9982 \pm 0.0048$

Recall $0.7660 \pm 0.0530$ $0.7675 \pm 0.0457$ $0.7692 \pm 0.0405$ $0.9970 \pm 0.0065$ $0.9978 \pm 0.0051$ $0.9983 \pm 0.0043$

$F 1$ -score $0.8017 \pm 0.0381$ $0.8050 \pm 0.0328$ $0.8070 \pm 0.0290$ $0.9969 \pm 0.0048$ $0.9977 \pm 0.0039$ $0.9982 \pm 0.0033$

Cross-entropy $0.2791 \pm 0.0368$ $0.2801 \pm 0.0325$ $0.2796 \pm 0.0288$ $0.0218 \pm 0.0373$ $0.0164 \pm 0.0301$ $0.0127 \pm 0.0257$

MLP AUC $0.9843 \pm 0.0096$ $0.9892 \pm 0.0087$ $0.9920 \pm 0.0081$ $0.9998 \pm 0.0003$ $0.9999 \pm 0.0003$ $0.9999 \pm 0.0002$

Accuracy $0.9495 \pm 0.0160$ $0.9580 \pm 0.0157$ $0.9644 \pm 0.0159$ $0.9923 \pm 0.0061$ $0.9940 \pm 0.0050$ $0.9949 \pm 0.0045$

Precision $0.9244 \pm 0.0370$ $0.9383 \pm 0.0336$ $0.9465 \pm 0.0312$ $0.9880 \pm 0.0135$ $0.9906 \pm 0.0113$ $0.9921 \pm 0.0099$

Recall $0.8741 \pm 0.0501$ $0.8969 \pm 0.0481$ $0.9878 \pm 0.0141$ $0.9878 \pm 0.0141$ $0.9904 \pm 0.0115$ $0.9920 \pm 0.0101$

$F 1$ -score $0.8975 \pm 0.0331$ $0.9163 \pm 0.0331$ $0.9294 \pm 0.0333$ $0.9878 \pm 0.0096$ $0.9904 \pm 0.0080$ $0.9920 \pm 0.0071$

Cross-entropy $0.1423 \pm 0.0340$ $0.1185 \pm 0.0360$ $0.1014 \pm 0.0384$ $0.0212 \pm 0.0097$ $0.0166 \pm 0.0089$ $0.0137 \pm 0.0085$

NHANES: National Health and Nutrition Examination Survey.

Mean and standard deviation are reported for the following performance metrics: (i) area under the curve (AUC); (ii) accuracy; (iii) recall; (iv) precision; (v) $F 1$ -score; and (vi) binary cross-entropy. All reported performance metrics are survey-weighted estimates based on the NHANES sampling design.

Consistency is observed across performance metrics for fixed sample sizes. Generally, increasing the sample size $n$ improves precision, reduces errors, and produces more stable results with lower variability.
3.1. Uncertainty quantification

		Scenario (a)	Scenario (b)
LR	AUC	$0.9325 \pm 0.0182$	$0.9328 \pm 0.0161$	$0.9332 \pm 0.0142$	$0.9998 \pm 0.0011$	$0.9998 \pm 0.0008$	$0.9999 \pm 0.0007$
	Accuracy	$0.9039 \pm 0.0178$	$0.9041 \pm 0.0153$	$0.9048 \pm 0.0135$	$0.9980 \pm 0.0030$	$0.9985 \pm 0.0024$	$0.9989 \pm 0.0021$
	Precision	$0.8438 \pm 0.04458$	$0.8485 \pm 0.0390$	$0.8506 \pm 0.0344$	$0.9968 \pm 0.0071$	$0.9976 \pm 0.0057$	$0.9982 \pm 0.0048$
	Recall	$0.7660 \pm 0.0530$	$0.7675 \pm 0.0457$	$0.7692 \pm 0.0405$	$0.9970 \pm 0.0065$	$0.9978 \pm 0.0051$	$0.9983 \pm 0.0043$
	$F 1$ -score	$0.8017 \pm 0.0381$	$0.8050 \pm 0.0328$	$0.8070 \pm 0.0290$	$0.9969 \pm 0.0048$	$0.9977 \pm 0.0039$	$0.9982 \pm 0.0033$
	Cross-entropy	$0.2791 \pm 0.0368$	$0.2801 \pm 0.0325$	$0.2796 \pm 0.0288$	$0.0218 \pm 0.0373$	$0.0164 \pm 0.0301$	$0.0127 \pm 0.0257$
MLP	AUC	$0.9843 \pm 0.0096$	$0.9892 \pm 0.0087$	$0.9920 \pm 0.0081$	$0.9998 \pm 0.0003$	$0.9999 \pm 0.0003$	$0.9999 \pm 0.0002$
	Accuracy	$0.9495 \pm 0.0160$	$0.9580 \pm 0.0157$	$0.9644 \pm 0.0159$	$0.9923 \pm 0.0061$	$0.9940 \pm 0.0050$	$0.9949 \pm 0.0045$
	Precision	$0.9244 \pm 0.0370$	$0.9383 \pm 0.0336$	$0.9465 \pm 0.0312$	$0.9880 \pm 0.0135$	$0.9906 \pm 0.0113$	$0.9921 \pm 0.0099$
	Recall	$0.8741 \pm 0.0501$	$0.8969 \pm 0.0481$	$0.9878 \pm 0.0141$	$0.9878 \pm 0.0141$	$0.9904 \pm 0.0115$	$0.9920 \pm 0.0101$
	$F 1$ -score	$0.8975 \pm 0.0331$	$0.9163 \pm 0.0331$	$0.9294 \pm 0.0333$	$0.9878 \pm 0.0096$	$0.9904 \pm 0.0080$	$0.9920 \pm 0.0071$
	Cross-entropy	$0.1423 \pm 0.0340$	$0.1185 \pm 0.0360$	$0.1014 \pm 0.0384$	$0.0212 \pm 0.0097$	$0.0166 \pm 0.0089$	$0.0137 \pm 0.0085$

In the context of conformal prediction for survey data, we focus again on generative models from simulation scenarios (a) and (b). However, the emphasis here is on evaluating model performance using empirical marginal coverage at different confidence levels, $α \in {0.10, 0.20, 0.40}$ .

In each simulation run $b = 1, \dots, B$ , we generate an independent test sample $D_{test}$ of size $N = 5000$ . For every observation $(X_{i}, Y_{i}) \in D_{test}$ , we construct the prediction region ${\tilde{C}}_{b}^{α} (X_{i})$ corresponding to miscoverage rate $α$ , and record whether the true response is covered, that is,

1 {Y_{i} \in {\tilde{C}}_{b}^{α} (X_{i})} .

The empirical marginal coverage in simulation

b

is then calculated as

{\hat{C}}^{(b)} (α) = \frac{1}{N} \sum_{i = 1}^{N} 1 {Y_{i} \in {\tilde{C}}_{b}^{α} (X_{i})} .

Repeating this procedure over

B

replications produces a distribution of coverage estimates

{{\hat{C}}^{(b)} (α)}_{b = 1}^{B}

for each nominal level

1 - α

. We summarize this distribution using boxplots (Figures 1 and 2), stratified by simulation scenario and sample size.

Figure 1.

Results of conformal simulation exercise for the generative model (a).

Figure 2.

Results of conformal simulation exercise for the generative model (b).

Our findings highlight the finite-sample validity of conformal prediction, which confirms that $P (Y \in {\tilde{C}}^{α} (X)) \geq 1 - α$ . As expected, the boxplots show convergence toward the nominal levels as the sample size increases, accompanied by a reduction in variability, thereby demonstrating stable and reliable coverage performance across the two simulation scenarios.

4. NHANES diabetes case study

4.1. Literature review: Predictive models for diabetes risk stratification

Various predictive models have been developed to stratify diabetes risk, including the Finnish (FINDRISC)³² and German (GDRS)³³ diabetes scores, formulated over a decade ago. These scores use logistic regression to predict the 10-year risk of diabetes or employ survival analysis techniques, such as Cox regression, to estimate the time to diabetes onset. They are based on easily accessible variables, such as age, sex, anthropometric measures, lifestyle, family medical history, and medication use.

Recent approaches have employed machine learning (ML) methodologies to predict the progression of diabetes in individuals with healthy or pre-diabetic states, yielding promising results.^34,35 However, these studies are primarily based on observational data, often exclude subjects with incomplete data, and thus warrant caution when applying these results clinically.

Our approach differs in that it focuses on predicting diabetes status at a specific point in time using current patient characteristics. This approach offers statistical advantages, increases case numbers, and mitigates the imbalance issues commonly encountered in longitudinal prediction models. Additionally, we employ a survey sampling design, enabling robust, generalizable predictions for the US population.

4.2. NHANES 2011–2014 data

To train and validate the neural network diabetes prediction model for the first time, we used data from the 2011–2014 NHANES waves,³⁶ a comprehensive survey targeting the civilian, non-institutionalized US population.

Data collection involved both interviews and clinical examinations. The interviews collected demographic, health, and nutritional information, while clinical examinations included physical measurements, blood pressure assessment, dental exams, and collection of blood and urine samples for laboratory analysis. The dataset analyzed comprises 5011 individuals aged 20 to 80 years (see Table 4 for more details on the demographic, anthropometric, and biochemical characteristics of the cohort).

Table 4.
Clinical characteristics of diabetes and non-diabetes patients.

Variable Diabetes ( $32 %$ ) No diabetes ( $68 %$ )

Age (years) $54.1 \pm 15.2$ $44.8 \pm 16.2$

Weight (kg) $88.3 \pm 22.5$ $80.8 \pm 19.3$

Height (cm) $169.1 \pm 10.1$ $169.4 \pm 9.6$

BMI (kg/m $^{2}$ ) $30.8 \pm 7.13$ $28.08 \pm 6.16$

Waist circumference (cm) $96.9 \pm 14.9$ $105.6 \pm 16.8$

Diastolic blood pressure (mmHg) $71.1 \pm 12.2$ $70.9 \pm 11.2$

Systolic blood pressure (mmHg) $126.0 \pm 16.5$ $119.2 \pm 15.5$

Pulse (bpm) $73.7 \pm 12.1$ $71.5 \pm 11.36$

Cholesterol (mmol/L) $5.0 \pm 1.1$ $4.98 \pm 1.0$

Triglycerides (mmol/L) $2.09 \pm 1.7$ $1.55 \pm 1.06$

Gender (male–female) $47 %$ – $53 %$ $50 %$ – $50 %$

Glucose (mg/dL) $123 \pm 48$ $89 \pm 10$

HbA1c (%) $6.22 \pm 1.35$ $5.33 \pm 0.35$

Variable	Diabetes ( $32 %$ )	No diabetes ( $68 %$ )
Age (years)	$54.1 \pm 15.2$	$44.8 \pm 16.2$
Weight (kg)	$88.3 \pm 22.5$	$80.8 \pm 19.3$
Height (cm)	$169.1 \pm 10.1$	$169.4 \pm 9.6$
BMI (kg/m $^{2}$ )	$30.8 \pm 7.13$	$28.08 \pm 6.16$
Waist circumference (cm)	$96.9 \pm 14.9$	$105.6 \pm 16.8$
Diastolic blood pressure (mmHg)	$71.1 \pm 12.2$	$70.9 \pm 11.2$
Systolic blood pressure (mmHg)	$126.0 \pm 16.5$	$119.2 \pm 15.5$
Pulse (bpm)	$73.7 \pm 12.1$	$71.5 \pm 11.36$
Cholesterol (mmol/L)	$5.0 \pm 1.1$	$4.98 \pm 1.0$
Triglycerides (mmol/L)	$2.09 \pm 1.7$	$1.55 \pm 1.06$
Gender (male–female)	$47 %$ – $53 %$	$50 %$ – $50 %$
Glucose (mg/dL)	$123 \pm 48$	$89 \pm 10$
HbA1c (%)	$6.22 \pm 1.35$	$5.33 \pm 0.35$

BMI: body mass index; HbA1c: glycosylated hemoglobin.

The key variables considered in our study include age (categorical and continuous), race, gender, cancer or diabetes diagnoses (categorical), blood pressure, grip strength, body mass index (BMI), and biochemical biomarkers such as cholesterol and triglycerides (continuous).

Race categories were encoded as follows: 1 $=$ Mexican American, 2 $=$ Other Hispanic, 3 $=$ Non-Hispanic White, 4 $=$ Non-Hispanic Black, 5 $=$ Non-Hispanic Asian, and 6 $=$ Other Race, including multiracial. Table 4 summarizes these variables.

We used the NHANES 2011–2014 because these adjacent 2-year cycles provide: (i) comprehensive availability of key diagnostic biomarkers—FPG and HbA1c—measured under stable, well-documented laboratory protocols; (ii) the ability to pool cycles using standard analytic weights to increase precision, as recommended by the NHANES analytic guidelines; and (iii) a pre-pandemic, methodologically homogeneous window. Later data collections were interrupted by the COVID-19 suspension (2019–2020) and subsequently combined with 2017–2018 into a special file “2017-March 2020 pre-pandemic” with custom weighting, complicating direct comparability. This design choice maximizes internal consistency while maintaining population representativeness.^36–38

We also selected this cohort because it includes the measures of physical activity in addition to other variables examined related to functional status, allowing the development of more complex predictive models. This breadth of information makes the 2011–2014 cycles preferable to others that lack comparable medical examination.

4.3. Diabetes definition

Our analysis involved 5011 participants, each assigned survey weights $w_{i}$ ( $i = 1, \dots, n = 5011$ ). The diabetes status was defined as binary, with “1” indicating the presence of diabetes and “0” its absence.

Diagnostic criteria for type-2 diabetes, according to the ADA guidelines,³⁹ include:

FPG $\geq$ 126 mg/dL (7.0 mmol/L), or

HbA1c $\geq$ 6.5% (48 mmol/mol), or

previous medical diagnosis of diabetes.

Although our models incorporate FPG and HbA1c, perfect predictive precision remains unattainable due to individuals previously diagnosed with diabetes who currently exhibit measurements in the non-diabetic range.

4.4. Experimental setup

The dataset was restricted to participants aged 20 to 80 years from the NHANES 2011–2014 survey. The dataset was randomly partitioned into $64 %$ training, $16 %$ validation, and $20 %$ testing subsets, using stratified sampling to preserve the class distribution of diabetic and non-diabetic cases across all splits.

Continuous predictors were evaluated both on the raw scale and after standardization (using the mean and standard deviation from the training split only). As no practical differences in predictive performance were observed between the two approaches, the reported experiments use the raw (non-standardized) values.

Categorical predictors were encoded as one-hot vectors, and this encoding was consistently applied across the training, validation, and test sets to ensure alignment of reference categories.

The NHANES survey weights were incorporated throughout the workflow: in model training via a weighted binary cross-entropy loss and in evaluation via survey-weighted performance metrics (AUC, accuracy, precision, recall, $F 1$ -score, and cross-entropy).

To ensure robust evaluation and reduce estimation bias, $k$ -fold cross-validation ( $k = 5$ ) was applied within the training split to tune hyperparameters, following the established practices in machine learning.⁴⁰ For each fold, the models were trained on $k - 1$ folds and validated on the remaining fold, with early stopping based on weighted validation loss and a patience parameter of $10$ epochs. Once the optimal hyperparameters were selected, the model was retrained on the combined training and validation sets and evaluated on the held-out test set.

Model optimization was performed using the Adam algorithm,²⁶ a stochastic gradient-based optimizer that adaptively estimates first- and second-order moments of the gradients, providing robust convergence across a wide range of learning rates and network architectures. Because the loss function of the neural network is non-convex, the fitted model may depend on the choice of optimizer and the random initialization of the weights. To ensure reproducibility, all random seeds were fixed, and the same optimizer was used consistently throughout the experiments.

The neural network was trained using a grid search over multiple hyperparameter configurations for each scenario. Three architectures for hidden layers were evaluated, with layer sizes set at [ $32$ , $16$ , $8$ ], [ $64$ , $32$ , $16$ ], and [ $128$ , $64$ , $32$ ], respectively. Batch sizes were varied over ${16, 32, 64}$ , and dropout rates were tested within ${0.2, 0.4, 0.6}$ to mitigate overfitting. The learning rates were explored in ${0.01, 0.001, 0.0001}$ , while the weight decay values were sampled from ${0, 10^{- 3}, 10^{- 4}}$ to regulate the complexity of the model. The optimal hyperparameter configuration was selected via cross-validation, based on the model performance in the validation set. The configuration used for the experiments reported in Tables 5 and 6 can be found in Table 8 in Appendix A.

Table 5.
Multilayer perceptron (MLP) neural network performance metrics and variable sets for Models 1 to 4.

Variable Model 1 Model 2 Model 3 Model 4

Variable 1 Age Height Age Age

Variable 2 Gender Weight Height Height

Variable 3 BMI Weight Weight

Variable 4 Waist BMI BMI

Variable 5 Diastolic blood pressure Waist Waist

Variable 6 Systolic blood pressure Diastolic blood pressure Diastolic blood pressure

Variable 7 Pulse Systolic blood pressure Systolic blood pressure

Variable 8 Pulse Pulse

Variable 9 Gender Cholesterol

Variable 10 Triglycerides

Variable 11 Gender

Cost model $€$ 0 $€$ 0 $€$ 0 $€$ 0.5

AUC 0.696 0.670 0.706 0.713

Accuracy 0.656 0.668 0.648 0.648

Precision 0.580 0.608 0.561 0.570

Recall 0.632 0.569 0.713 0.627

$F 1$ -score 0.605 0.588 0.628 0.597

Cross-entropy 0.617 0.618 0.606 0.604

Variable	Model 1	Model 2	Model 3	Model 4
Variable 1	Age	Height	Age	Age
Variable 2	Gender	Weight	Height	Height
Variable 3		BMI	Weight	Weight
Variable 4		Waist	BMI	BMI
Variable 5		Diastolic blood pressure	Waist	Waist
Variable 6		Systolic blood pressure	Diastolic blood pressure	Diastolic blood pressure
Variable 7		Pulse	Systolic blood pressure	Systolic blood pressure
Variable 8			Pulse	Pulse
Variable 9			Gender	Cholesterol
Variable 10				Triglycerides
Variable 11				Gender
Cost model	$€$ 0	$€$ 0	$€$ 0	$€$ 0.5
AUC	0.696	0.670	0.706	0.713
Accuracy	0.656	0.668	0.648	0.648
Precision	0.580	0.608	0.561	0.570
Recall	0.632	0.569	0.713	0.627
$F 1$ -score	0.605	0.588	0.628	0.597
Cross-entropy	0.617	0.618	0.606	0.604

BMI: body mass index; AUC: area under the curve; NHANES: National Health and Nutrition Examination Survey.

All reported performance metrics are survey-weighted estimates based on the NHANES sampling design.

Table 6.

Multilayer perceptron (MLP) neural network performance metrics and variable sets for Models 5 to 7.

Variable	Model 5	Model 6	Model 7
Variable 1	Age	Age	Age
Variable 2	Height	Height	Height
Variable 3	Weight	Weight	Weight
Variable 4	BMI	BMI	BMI
Variable 5	Waist	Waist	Waist
Variable 6	Diastolic blood pressure	Diastolic blood pressure	Diastolic blood pressure
Variable 7	Systolic blood pressure	Systolic blood pressure	Systolic blood pressure
Variable 8	Pulse	Pulse	Pulse
Variable 9	Cholesterol	Cholesterol	Cholesterol
Variable 10	Triglycerides	Triglycerides	Triglycerides
Variable 11	Gender	Gender	Gender
Variable 12	Glycosylated hemoglobin	Glucose	Glucose
Variable 13			Glycosylated hemoglobin
Cost model	$€$ 4.5	$€$ 2.1	$€$ 6.1
AUC	0.849	0.826	0.998
Accuracy	0.779	0.741	0.979
Precision	0.767	0.784	0.977
Recall	0.675	0.722	0.973
$F 1$ -score	0.718	0.75	0.975
Cross-entropy	0.446	0.479	0.064

BMI: body mass index; AUC: area under the curve; NHANES: National Health and Nutrition Examination Survey.

All reported performance metrics are survey-weighted estimates based on the NHANES sampling design.

4.5. Model performance results

We systematically evaluated seven predictive models of different complexity (in terms of measurement and economic cost to perform the corresponding clinical tests) in a holdout test sample to detect diabetes, which was not used for model training or hyperparameter tuning (see Tables 5 and 6). For this purpose, we considered discriminative metrics such as AUC, accuracy, recall, precision, $F 1$ -score, and cross-entropy, along with the economic cost of the variables. While $F 1$ -score and cross-entropy are commonly used to evaluate the technical fit of models, we prioritize AUC, accuracy, precision, and recall for interpreting results. These metrics correspond to discrimination and diagnostic accuracy measures with direct relevance to medical decision-making.⁴¹ In this context, the term “economic cost” refers to the direct monetary cost of obtaining the clinical variables required by each predictive model, rather than the computational complexity of the algorithm. Models relying solely on demographic and anthropometric predictors (e.g. age, sex, weight, height, BMI, blood pressure, or pulse) are considered cost-free, as these measurements are routinely collected in standard clinical practice and do not incur additional expenditure. In contrast, models that incorporate laboratory-based biomarkers, such as cholesterol, triglycerides, FPG, or HbA1c, entail non-negligible costs, typically ranging from approximately $€$ 2 to $€$ 6 per patient, depending on the set of assays included. This definition enables a systematic evaluation of the trade-off between predictive precision and economic feasibility, which is particularly relevant in large-scale population screening for diabetes mellitus. Note that the cross-entropy values reported correspond to the survey-weighted binary cross-entropy loss, as defined in Section 7, and are therefore on a different numerical scale from bounded metrics such as AUC or $F 1$ -score. These values are directly comparable across models within the same table, but cannot be interpreted on the same 0–1 scale as the discriminative metrics.

Models 1 to 4 gradually included more clinical variables, starting with basic demographics and extending to physiological and metabolic markers such as cholesterol and triglycerides. Model 4 significantly improved performance ( $AUC = 0.713$ ; $precision = 64.8 %$ ) by incorporating the mentioned metabolic markers. Model 5 further improved predictive performance, achieving an AUC of 0.849 and an accuracy of $77.9 %$ , notably due to the inclusion of HbA1c. Model 7, the most comprehensive, achieved the highest performance ( $AUC = 0.998$ ; $accuracy = 97.9 %$ ; $precision = 97.7 %$ ), benefiting from an extensive integration of clinical variables, including the biomarkers used in the diagnosis of diabetes mellitus.

Our analysis reveals a clear relationship between model complexity, accuracy, and operational costs. The significant performance gains observed from Model 1 to Model 7 underscore the importance of careful variable selection, highlighting the need to balance improved predictive accuracy with practical considerations of cost and feasibility in clinical settings when performing screening for diabetes mellitus. To facilitate comparison with batchwise matching, the Supplemental materials include results from a linear logistic regression model from the NHANES 2011–2014 that used the described seven models. Across cross-validation folds—especially for Models 5–7—the non-linear neural network models demonstrate better predictive discrimination than logistic regression. In particular, the AUC improves, reflecting performance across all decision thresholds, whereas other metrics, such as accuracy, are less informative because they depend on a single cutoff (often 0.5).

Model 6 (model measurement cost of $€$ 2.1 per individual examination), which incorporates FPG, showed the best overall performance, achieving an accuracy of $74.1 %$ , a precision of $78.4 %$ , and a recall of $72.2 %$ . The models most clinically useful are Models 3 and 6. Model 3, developed exclusively using demographic and anthropometric variables from a population-based survey, achieved an accuracy of $64.7 %$ , a precision of $56.1 %$ , and a recall of $71.3 %$ . This means that it correctly identifies about two-thirds of individuals and, in particular, detects approximately $72 %$ of true cases of diabetes, although with a probability of almost $29 %$ that a diabetic individual remains undetected. Thus, while Model 3 is a valuable tool for population screening, its use results in false positives that lead to unnecessary additional tests and false negatives that may delay diagnosis and clinical intervention.

These metrics indicate that the probability of a diabetic individual not being identified is reduced to $28 %$ , with the model detecting around $72 %$ cases (231 of the 320 in a cohort of 1000 individuals), while also substantially reducing the number of false positives compared to Model 3 (72 vs. 112 per 1000 individuals). Consequently, Model 6 achieves a favorable balance between recall and precision, making it a more robust and efficient tool for population screening, provided it is complemented with confirmatory diagnostic tests in clinical practice. However, it should be noted that this model includes people with a prior diagnosis of diabetes but with fasting glucose levels within the normal range, thereby preserving the representativeness of the population but potentially slightly influencing the model’s true discriminatory capacity. From a public health perspective, an efficient strategy could be to apply Model 3 as a first low-cost filter to broadly identify at-risk groups, followed by Model 6 as a second, more specific layer to refine detection and reduce false positives before clinical confirmation.

4.6. External validation: Comparing neural networks and logistic regression using NHANES Survey Data (2009–2010, 2015–2016)

To assess whether the added complexity of the model is justified for detecting diabetes, we externally validated a neural network model against a competing survey-weighted logistic regression (LR) model. To minimize the distribution shift relative to the development data—and to account for potential changes in diabetes prevalence and covariate distributions throughout the survey years—we selected temporally proximate NHANES cycles for external validation. The NHANES 2009–2010 and 2015–2016 survey cycles were strictly excluded from all training, hyperparameter tuning, and cross-validation. The resulting dataset comprised a total of 11,761 individuals.

Table 7.
Comparison results across Models 1–7 for the multilayer perceptron (MLP) neural network and the logistic regression (LR).

MLP LR

AUC Acc Prec Rec $F 1$ CE AUC Acc Prec Rec $F 1$ CE

Model 1 0.775 (0.764–0.785) 0.851 0.826 0.851 0.836 0.489 0.669 (0.655–0.682) 0.770 0.851 0.770 0.801 3.671

Model 2 0.780 (0.769–0.789) 0.718 0.871 0.718 0.766 0.532 0.709 (0.696–0.721) 0.772 0.864 0.772 0.805 3.642

Model 3 0.816 (0.806–0.826) 0.776 0.867 0.776 0.808 0.479 0.737 (0.725–0.750) 0.782 0.874 0.782 0.813 3.481

Model 4 0.830 (0.821–0.839) 0.800 0.867 0.800 0.825 0.483 0.741 (0.730–0.754) 0.781 0.875 0.781 0.813 3.495

Model 5 0.947 (0.940–0.953) 0.854 0.917 0.854 0.873 0.419 0.868 (0.858–0.875) 0.857 0.919 0.857 0.875 2.285

Model 6 0.906 (0.897–0.914) 0.893 0.907 0.893 0.898 0.411 0.829 (0.818–0.840) 0.847 0.906 0.847 0.866 2.447

Model 7 0.942 (0.935–0.948) 0.879 0.921 0.879 0.893 0.389 0.863 (0.854–0.872) 0.852 0.917 0.852 0.871 2.365

	MLP	LR
Model 1	0.775 (0.764–0.785)	0.851	0.826	0.851	0.836	0.489	0.669 (0.655–0.682)	0.770	0.851	0.770	0.801	3.671
Model 2	0.780 (0.769–0.789)	0.718	0.871	0.718	0.766	0.532	0.709 (0.696–0.721)	0.772	0.864	0.772	0.805	3.642
Model 3	0.816 (0.806–0.826)	0.776	0.867	0.776	0.808	0.479	0.737 (0.725–0.750)	0.782	0.874	0.782	0.813	3.481
Model 4	0.830 (0.821–0.839)	0.800	0.867	0.800	0.825	0.483	0.741 (0.730–0.754)	0.781	0.875	0.781	0.813	3.495
Model 5	0.947 (0.940–0.953)	0.854	0.917	0.854	0.873	0.419	0.868 (0.858–0.875)	0.857	0.919	0.857	0.875	2.285
Model 6	0.906 (0.897–0.914)	0.893	0.907	0.893	0.898	0.411	0.829 (0.818–0.840)	0.847	0.906	0.847	0.866	2.447
Model 7	0.942 (0.935–0.948)	0.879	0.921	0.879	0.893	0.389	0.863 (0.854–0.872)	0.852	0.917	0.852	0.871	2.365

NHANES: National Health and Nutrition Examination Survey.

Mean values are reported for: (i) area under the curve (AUC); (ii) accuracy; (iii) precision; (iv) recall; (v) $F 1$ -score ( $F 1$ ); and (vi) binary cross-entropy (CE). All reported metrics are survey-weighted estimates based on the NHANES test set (2009 and 2015 cycles) and include all individuals aged 20–80 years. The AUC column additionally includes the 95% bootstrap confidence interval, computed with 1000 bootstrap replicates ( $B = 1000$ ).

Although Table 7 reports multiple performance metrics for seven model specifications (Models 1–7), we emphasize the AUC metric in our detailed analysis because it provides a threshold-independent measure of discrimination.

Across the NHANES 2009–2010 and 2015–2016 samples, neural network models achieved higher AUC than logistic regression baselines, suggesting that flexible, non-linear architectures can adapt in a data-driven manner to complex functional relationships between predictors and diabetes risk. While logistic regression can accommodate non-linearities through pre-specified interaction terms, polynomial expansions, or spline transformations, such structures must be specified a priori. In contrast, neural networks can automatically learn these representations from the data, which may explain their improved discrimination in this setting.

From a clinical perspective, improved discrimination can support better identification and reclassification of at-risk individuals; however, these gains must be weighed against the risk of overfitting and the need for population-level generalizability. In general, neural network models with lower clinical complexity (Models 1–4) improved AUC by approximately 10% relative to logistic regression. When diabetes-specific biomarkers were included (Models 5–7), the incremental benefit of NN was smaller. Both approaches generalized well, but neural networks’ performance remained closer to that observed in the NHANES 2011–2014, while logistic regression showed a greater decline in external cycles. Taken together, these findings support the use of flexible non-linear models to improve diabetes risk prediction when only basic predictors are available, while the advantage narrows when more direct biomarkers (e.g. HbA1c or FPG) are included.

Table 7 also reports 95% confidence intervals for the AUC, based on $B = 1000$ bootstrap resamples, for all models evaluated in the NHANES data for the 2009–2010 and 2015–2016 cycles. Across configurations, the intervals are consistently narrow, indicating limited sampling variability. In particular, for Model 7 the multilayer perceptron (MLP) achieves an AUC of 0.942 with a 95% confidence interval of [0.935, 0.948], corresponding to a width of approximately 0.013. Similar behavior is observed for the remaining MLP configurations, as well as for logistic regression, whose interval widths range from approximately 0.013 to 0.027. Overall, the small dispersion of the bootstrap distributions supports the stability of the reported discriminative performance in the external validation setting and highlights the practical advantages of the neural network models in the validation cohort.

4.7. Uncertainty quantification in NHANES 2011–2014

A central challenge in predictive modeling is not only to provide an outcome label (e.g. diabetes = 1 vs. 0), but also to communicate how reliable this prediction is for a specific individual. In clinical decision-making, physicians rarely act on a binary prediction alone: they also need to understand the degree of uncertainty surrounding it. Two patients may both be classified as “at-risk for diabetes,” yet the confidence of the model in these predictions can differ substantially, with implications for follow-up testing or treatment decisions. Capturing and quantifying this confidence variability at the individual level is therefore essential to transition from population-wide models to personalized, clinically actionable tools.

Formally, conformal prediction provides individual-specific prediction sets. For each individual and each confidence level $α \in (0, 1)$ , the procedure yields

\begin{aligned} {\tilde{C}}^{α} (X_{n + 1}; D_{n}) \subseteq {\emptyset, 0, 1}, \end{aligned}

which guarantees coverage but is difficult to interpret clinically. In particular, for binary outcomes, the set can contain both labels

{0, 1}

, leaving the physician uncertain about how to act, and it does not directly explain how the individual’s covariates contribute to uncertainty. To overcome this limitation, we propose an individualized score

h_{i}

that summarizes width of the conformal prediction set into a single calibrated number, interpretable as a personalized “uncertainty index.”

Following Algorithm 1, for observation $i$ , the conformal inference-based score is defined as

h_{i} = \tilde{s} (X_{i}, 1) - {\tilde{q}}_{α} (X_{i}) + Q_{1 - α} (S, I_{3}),

where

\tilde{s} (X_{i}, 1)

denotes the predicted diabetes score for the participant

i

{\tilde{q}}_{α} (X_{i})

represents the quantile regression estimate of the score

{\tilde{s}}_{i}

, and

Q_{1 - α} (S, I_{3})

is the corresponding calibration constant. For practical purposes, we select

α = 0.05

. By construction,

h_{i}

is not a clinical biomarker itself, but rather a measure of the width of the conformal prediction set for this patient. A larger

h_{i}

reflects greater uncertainty (lower model confidence), whereas a smaller

h_{i}

reflects more decisive predictions. This index provides a direct, individualized quantification of uncertainty that complements the binary label, translating statistical guarantees into a form that can be meaningfully interpreted in clinical settings.

To explore which patient characteristics are most strongly associated with predictive uncertainty, we modeled the individual uncertainty score $h_{i}$ as a function of FPG, heart rate (HR), and age. We adopted an additive modeling framework (Generalized additive model) with smooth spline terms, allowing each covariate to contribute flexibly and non-linearly to the outcome. Formally, the model can be expressed as

h_{i} = β_{0} + f_{FPG} ({FPG}_{i}) + f_{HR} ({HR}_{i}) + f_{Age} ({Age}_{i}) + ε_{i},

where

β_{0}

is the intercept,

f_{FPG} (\cdot)

f_{HR} (\cdot)

, and

f_{Age} (\cdot)

are smooth functions estimated from the data, and

ε_{i}

represents the residual variation. This specification enables the model to capture complex associations beyond linear trends.

The results of the splines adjustment (Figure 3) reveal several key epidemiological patterns:

Highest uncertainty was observed among adults (50–70 years) with elevated heart rate and FPG values in the pre-diabetic range (100–125 mg/dL).

Lower uncertainty occurred in clearly normoglycemic or diabetic older adults, as well as in younger individuals with low HR and low FPG, where the model predictions were more reliable.

Clinical relevance: Conformal scores can therefore indicate subgroups where predictions are fragile and variable, such as older adults in the pre-diabetic range with elevated HR. For these individuals, additional diagnostic testing or closer monitoring may be warranted.

Figure 3.

Covariate effect to predict the score function $h_{i}$ estimated with a new conformal survey prediction algorithm. LBXSGL_43 denotes the variable FPG. FPG: fasting plasma glucose.

5. Discussion

This work introduces a novel predictive framework based on neural network models that can accommodate a range of complex sampling designs, including stratified sampling, Bernoulli sampling, and maximum entropy sampling. A theoretical analysis introduced in Appendix A demonstrates that the proposed regression methods achieve universal consistency under certain regularity conditions related to survey designs, analogous to the consistency obtained with i.i.d. data, leveraging recent advances in empirical process theory for survey data.

In the simpler models in the case study NHANES 2011–2014, Models 1–4, the gains of information in terms of statistical association are modest; however, as model complexity increases (Models 5–7), the non-linear capabilities of the neural networks yield more pronounced advantages. In particular, in external validation data (NHANES 2009–2010 and 2015–2016), the magnitude of the improvement is greater than that of the linear logistic regression model, particularly for the AUC prediction metric.

Accordingly, these findings should not be interpreted as a blanket endorsement of neural networks over traditional methods. Instead, our results indicate that the observed performance gains are most pronounced in the presence of non-linear associations and higher-order interaction effects. Neural networks can adaptively learn such structures from the data, while logistic regression generally relies on explicit pre-specification of non-linear transformations and interaction terms, which can be challenging to choose and may require prior substantive knowledge.

In clinical applications focused on predicting diabetes risk using the NHANES 2011–2014 data, we evaluated models of varying complexity and their associated economic costs. From a public health perspective, our findings are significant as they quantify the risk of diabetes using models that differ in complexity, resource requirements, and predictive efficacy. In particular, simpler models demonstrated sufficient effectiveness in certain subgroups of patients, facilitating diabetes risk screening. Quantification of uncertainty revealed that age, elevated heart rate, and baseline glucose levels in the pre-diabetic range significantly contribute to the uncertainty of predictions. Despite utilizing the main diagnostic criteria, perfect predictive performance is unattainable due to people who have normal glucose levels, but were previously diagnosed as diabetic.

Beyond methodological relevance, the uncertainty quantification step has direct clinical implications. GAM-based analysis of conformal uncertainty scores highlights patient subgroups where predictions are less reliable, such as older adults with elevated heart rate and fasting glucose values in the pre-diabetic range. In practice, these individualized uncertainty estimates can be used to identify patients who may benefit from more frequent follow-up schedules or additional diagnostic tests, thus supporting risk-stratified decision-making and more efficient allocation of healthcare resources.

A model excluding laboratory measurements achieved an AUC of 0.71, comparable to traditional diabetes risk scores, which incorporate anthropometric and cardiac variables alongside basic demographic characteristics. A recent comment⁴² highlighted the inconsistency of diabetes scores in observational cohorts and limited experimental designs, a concern that aligned with our introduction and underscores the reproducibility crisis. Our proposed models offer improved population-level reproducibility. Future analyses may include creating patient phenotypes based on prediction uncertainty or identifying clinically interpretable subphenotypes where models demonstrate high discriminative capacity. For patients with high prediction uncertainty or poor predictive performance, personalized monitoring strategies, alternative measurements, or the integration of longitudinal data could be essential.

From a methodological perspective, despite extensive research on classical regression models in survey contexts, the application of machine learning remains limited. We introduce the first general NN-based framework that integrates regression and classification methods, along with a conformal-prediction-based uncertainty quantification algorithm. Although comparisons between machine learning methods and logistic regression are well established in the literature, the novelty of this work lies in extending neural network modeling to properly account for survey weighting, a capability that has long been available for logistic regression. The relative performance of the two approaches depends on how well the logistic form matches the true data-generating process. When the logistic model is correctly specified, it can outperform the neural network; however, when the model is misspecified, the greater flexibility of the neural network may yield improved predictive performance. All code and analytical scripts are publicly available on GitHub, facilitating the adoption of these methods to improve outcomes in precision public health and epidemiology. To highlight the relevance of this study, it is essential to note that more than 50,000 articles have been published to date using the NHANES data, all of which require the use of reliable statistical methods.

For future developments, several methodological extensions are recommended for the neural network framework. First, expanding the approach to handle time-to-event analyses could significantly improve predictions for censored outcomes. Second, developing novel diagnostic tools using conditional receiver operating characteristic analyses would enhance the model’s applicability. Third, incorporating functional data analysis techniques to manage random predictors in functional spaces could substantially benefit digital health applications, particularly when analyzing accelerometer data, minute-level measurements, or continuous glucose monitoring data.^7,43

Footnotes

ORCID iDs

Marcos Matabuena

Rahul Ghosal

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work has received financial support from Spanish Ministry of Science, Innovation and Universities (grants PID2023-149549NB-I00 and PDC2025-166312-I00).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Appendix A

The primary goal of this paper is to model empirical diabetes data. However, the methods introduced here are asymptotically consistent, leveraging recent progress in empirical process theory for survey sampling. In this appendix, we sketch the essential ideas behind these developments, referring in particular to Han and Wellner.⁴⁴

Moreover, we connect these survey-based empirical-process techniques to established characterizations for i.i.d. data in the neural network literature. Specifically, existing results on metric bracketing and Vapnik–Chervonenkis-dimension arguments—which often assume i.i.d. samples—can be extended to the survey setting by substituting the Horvitz–Thompson measure in place of the usual empirical measure. This substitution preserves key properties of the empirical-process framework, ensuring the validity of asymptotic consistency and normality results for a neural network trained on survey data.

References

. The crisis of reproducibility, the denominator problem and the scientific role of multi-scale modeling. Bull Math Biol 2018; 80: 3071–3080.

Begley

Ioannidis

. Reproducibility in science: improving the standard for basic and preclinical research. Circ Res 2015; 116: 116–126.

Munafò

Chambers

Collins

, et al. The reproducibility debate is an opportunity, not a crisis. BMC Res Notes 2022; 15: 43.

Yates

. Sir Ronald Fisher and the design of experiments. Biometrics 1964; 20: 307–321.

Robertson

Lee

López-Kolkovska

, et al. Response-adaptive randomization in clinical trials: from myths to practical considerations. Stat Sci 2023; 38: 185–208.

Matabuena

Félix

Hammouri

ZAA

, et al. Physical activity phenotypes and mortality in older adults: a novel distributional data analysis of accelerometry in the NHANES. Aging Clin Exp Res 2022; 34: 3107–3114.

Matabuena

Petersen

. Distributional data analysis of accelerometer data from the NHANES database using nonparametric survey regression models. J R Stat Soc Ser C Appl Stat 2023; 72: 294–313.

Zou

Matabuena

Kosorok

. Distributional random forests for complex survey designs on reproducing kernel Hilbert spaces, https://arxiv.org/abs/2512.08179 (2026).

Bradley

Nichols

. Addressing selection bias in the UK Biobank neurological imaging cohort. MedRxiv:2022-01.

10.

Swanson

. The UK Biobank and selection bias. Lancet 2012; 380: 110.

11.

Dahlöf

. Cardiovascular disease risk factors: epidemiology and risk assessment. Am J Cardiol 2010; 105: 3A–9A.

12.

Hoeyer

. Data as promise: reconfiguring Danish public health through personalized medicine. Soc Stud Sci 2019; 49: 531–555.

13.

Martin

Ruder

Escarce

et al. Developing predictive models of health literacy. J Gen Intern Med 2009; 24: 1211–1216.

14.

Wiemken

Kelley

. Machine learning in epidemiology and health outcomes research. Annu Rev Public Health 2020; 41: 21–36.

15.

Hammouri

ZAA

Mier

Félix

, et al. Uncertainty quantification in medicine science: the next big step. Arch Bronconeumol 2023; 59: 760–761.

16.

Parker

Lin

Mahoney

, et al. Economic costs of diabetes in the U.S. in 2022. Diabetes Care 2024; 47: 26–43.

17.

Turnbull

Firth

Wilkie

et al. Population screening requires robust evidence—genomics is no exception. Lancet 2024; 403: 583–586.

18.

National Center for Chronic Disease Prevention and Health Promotion. National diabetes statistics report, 2020: estimates of diabetes and its burden in the United States. Technical report, National Center for Chronic Disease Prevention and Health Promotion, Division of Diabetes Translation, Centers for Disease Control and Prevention (U.S.), 2020.

19.

Heilmann

Trenkamp

Möser

, et al. Precise glucose measurement in sodium fluoride-citrate plasma affects estimates of prevalence in diabetes and prediabetes. Clin Chem Lab Med 2024; 62: 762–769.

20.

Kotzaeridi

Blätter

Eppel

, et al. Characteristics of gestational diabetes subtypes classified by oral glucose tolerance test values. Eur J Clin Invest 2021; 51: e13628.

21.

Lumley

Scott

. Fitting regression models to survey data. Stat Sci 2017; 32: 265–278.

22.

Harms

Duchesne

. On kernel nonparametric regression designed for complex survey data. Metrika 2010; 72: 111–138.

23.

Fan

. Local polynomial modelling and its applications, monographs on statistics and applied probability. 1st ed., Vol. 66. Boca Raton, FL: Chapman & Hall/CRC, 1996.

24.

Fraser

Lipsitz

Sinha

, et al. A note on median regression for complex surveys. Biostatistics 2021; 23: 1074–1082.

25.

Fan

Zhong

. A selective overview of deep learning. Stat Sci 2021; 36: 264–290.

26.

Kingma

. Adam: a method for stochastic optimization. In: Bengio Y and LeCun Y (eds) 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings, 2015.

27.

Banerji

Chakraborti

Harbron

, et al. Clinical AI tools must convey predictive uncertainty for each individual patient. Nat Med 2023; 29: 2996–2998.

28.

Barber

Candes

Ramdas

, et al. Conformal prediction beyond exchangeability. Ann Stat 2022; 52: 816–845.

29.

Vovk

Gammerman

Shafer

. Algorithmic learning in a random world. Vol . 29. New York: Springer, 2005.

30.

Tibshirani

Foygel Barber

Candes

, et al. Conformal prediction under covariate shift. In: Wallach H, Larochelle H, Beygelzimer A, d’ Alché-Buc F, Fox E and Garnett R (eds) Advances in neural information processing systems. Vol. 32. Curran Associates, Inc., 2019.

31.

Cauchois

Gupta

Duchi

. Knowing what you know: valid and validated confidence sets in multiclass and multilabel prediction. J Mach Learn Res 2021; 22: 1–42.

32.

Makrilakis

Liatis

Grammatikou

, et al. Validation of the Finnish diabetes risk score (FINDRISC) questionnaire for screening for undiagnosed type 2 diabetes, dysglycaemia and the metabolic syndrome in Greece. Diabetes Metab 2011; 37: 144–151.

33.

Mühlenbruch

Paprott

Joost

, et al. Derivation and external validation of a clinical version of the German Diabetes Risk Score (GDRS) including measures of HbA1c. BMJ Open Diabetes Res Care 2018; 6: e000524.

34.

Cahn

Shoshan

Sagiv

, et al. Prediction of progression from pre-diabetes to diabetes: development and validation of a machine learning model. Diabetes Metab Res Rev 2020; 36: e3252.

35.

Cai

, et al. Machine learning for predicting the 3-year risk of incident diabetes in Chinese adults. Front Public Health 2021; 9.

36.

Johnson

Dohrmann

Burt

, et al. National Health and Nutrition Examination Survey: sample design, 2011–2014. Technical Report 162, National Center for Health Statistics. DHHS Publication No. (PHS) 2014-1362, 2014.

37.

National Center for Health Statistics. Analytic guidelines, 2011–2012 and continuous NHANES. Technical report, National Center for Health Statistics, Centers for Disease Control and Prevention (U.S.), 2013.

38.

Stierman

Afful

Carroll

, et al. NHANES 2017–March 2020 pre-pandemic data files: development and prevalence estimates. Technical Report 158, National Center for Health Statistics, Hyattsville, MD. National Health Statistics Reports No. 158, 2021.

39.

American Diabetes Association. Classification and diagnosis of diabetes: standards of medical care in diabetes—2020. Diabetes Care 2020; 43: S14–S31.

40.

Goodacre

. On splitting training and validation set: a comparative study of cross-validation, bootstrap and systematic sampling for estimating the generalization performance of supervised learning. J Anal Test 2018; 2: 249–262.

41.

Mallett

Halligan

Thompson

, et al. Interpreting diagnostic accuracy studies for patient care. BMJ 2012; 345: e3999

42.

Mohsen

Al-Absi

Yousri

, et al. A scoping review of artificial intelligence-based methods for diabetes risk prediction. npj Digit Med 2023; 6: 197.

43.

Matabuena

Ghosal

Meiring

, et al. Predicting distributions of physical activity profiles in the National Health and Nutrition Examination Survey database using a partially linear Fréchet single index model. Biostatistics 2025; 26: kxaf013.

44.

Han

Wellner

. Complex sampling designs: uniform limit theorems and applications. Ann Stat 2021; 49: 459–485.

45.

Horvitz

Thompson

. A generalization of sampling without replacement from a finite universe. J Am Stat Assoc 1952; 47: 663–685.

46.

Bartlett

Harvey

Liaw

, et al. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. J Mach Learn Res 2019; 20: 1–17.

47.

Rynkiewicz

. Asymptotic statistics for multilayer perceptron with ReLU hidden units. Neurocomputing 2019; 342: 16–23.

		Scenario (a)			Scenario (b)
		$N = 5000$	$N = 10,000$	$N = 20,000$	$N = 5000$	$N = 10,000$	$N = 20,000$
LR	AUC	$0.9325 \pm 0.0182$	$0.9328 \pm 0.0161$	$0.9332 \pm 0.0142$	$0.9998 \pm 0.0011$	$0.9998 \pm 0.0008$	$0.9999 \pm 0.0007$
	Accuracy	$0.9039 \pm 0.0178$	$0.9041 \pm 0.0153$	$0.9048 \pm 0.0135$	$0.9980 \pm 0.0030$	$0.9985 \pm 0.0024$	$0.9989 \pm 0.0021$
	Precision	$0.8438 \pm 0.04458$	$0.8485 \pm 0.0390$	$0.8506 \pm 0.0344$	$0.9968 \pm 0.0071$	$0.9976 \pm 0.0057$	$0.9982 \pm 0.0048$
	Recall	$0.7660 \pm 0.0530$	$0.7675 \pm 0.0457$	$0.7692 \pm 0.0405$	$0.9970 \pm 0.0065$	$0.9978 \pm 0.0051$	$0.9983 \pm 0.0043$
	$F 1$ -score	$0.8017 \pm 0.0381$	$0.8050 \pm 0.0328$	$0.8070 \pm 0.0290$	$0.9969 \pm 0.0048$	$0.9977 \pm 0.0039$	$0.9982 \pm 0.0033$
	Cross-entropy	$0.2791 \pm 0.0368$	$0.2801 \pm 0.0325$	$0.2796 \pm 0.0288$	$0.0218 \pm 0.0373$	$0.0164 \pm 0.0301$	$0.0127 \pm 0.0257$
MLP	AUC	$0.9843 \pm 0.0096$	$0.9892 \pm 0.0087$	$0.9920 \pm 0.0081$	$0.9998 \pm 0.0003$	$0.9999 \pm 0.0003$	$0.9999 \pm 0.0002$
	Accuracy	$0.9495 \pm 0.0160$	$0.9580 \pm 0.0157$	$0.9644 \pm 0.0159$	$0.9923 \pm 0.0061$	$0.9940 \pm 0.0050$	$0.9949 \pm 0.0045$
	Precision	$0.9244 \pm 0.0370$	$0.9383 \pm 0.0336$	$0.9465 \pm 0.0312$	$0.9880 \pm 0.0135$	$0.9906 \pm 0.0113$	$0.9921 \pm 0.0099$
	Recall	$0.8741 \pm 0.0501$	$0.8969 \pm 0.0481$	$0.9878 \pm 0.0141$	$0.9878 \pm 0.0141$	$0.9904 \pm 0.0115$	$0.9920 \pm 0.0101$
	$F 1$ -score	$0.8975 \pm 0.0331$	$0.9163 \pm 0.0331$	$0.9294 \pm 0.0333$	$0.9878 \pm 0.0096$	$0.9904 \pm 0.0080$	$0.9920 \pm 0.0071$
	Cross-entropy	$0.1423 \pm 0.0340$	$0.1185 \pm 0.0360$	$0.1014 \pm 0.0384$	$0.0212 \pm 0.0097$	$0.0166 \pm 0.0089$	$0.0137 \pm 0.0085$

	MLP						LR
	AUC	Acc	Prec	Rec	$F 1$	CE	AUC	Acc	Prec	Rec	$F 1$	CE
Model 1	0.775 (0.764–0.785)	0.851	0.826	0.851	0.836	0.489	0.669 (0.655–0.682)	0.770	0.851	0.770	0.801	3.671
Model 2	0.780 (0.769–0.789)	0.718	0.871	0.718	0.766	0.532	0.709 (0.696–0.721)	0.772	0.864	0.772	0.805	3.642
Model 3	0.816 (0.806–0.826)	0.776	0.867	0.776	0.808	0.479	0.737 (0.725–0.750)	0.782	0.874	0.782	0.813	3.481
Model 4	0.830 (0.821–0.839)	0.800	0.867	0.800	0.825	0.483	0.741 (0.730–0.754)	0.781	0.875	0.781	0.813	3.495
Model 5	0.947 (0.940–0.953)	0.854	0.917	0.854	0.873	0.419	0.868 (0.858–0.875)	0.857	0.919	0.857	0.875	2.285
Model 6	0.906 (0.897–0.914)	0.893	0.907	0.893	0.898	0.411	0.829 (0.818–0.840)	0.847	0.906	0.847	0.866	2.447
Model 7	0.942 (0.935–0.948)	0.879	0.921	0.879	0.893	0.389	0.863 (0.854–0.872)	0.852	0.917	0.852	0.871	2.365

Screening for diabetes mellitus in the US population using neural network-based modeling and complex survey designs

Abstract

Keywords

1. Introduction

1.2. Machine learning models for survey data

1.3. Summary of contributions

1.4. Outline of the paper

2. Methodology

2.1. Model estimation

2.2. Computational details

2.3.1. Background on conformal prediction and uncertainty quantification

2.3.3. Conformal prediction under covariate shift for survey data

Theorem 2 Tibshirani et al. 30

2.3.4. Conformal prediction beyond exchangeability

3. Simulation study

4.1. Literature review: Predictive models for diabetes risk stratification

4.2. NHANES 2011–2014 data

4.4. Experimental setup

4.6. External validation: Comparing neural networks and logistic regression using NHANES Survey Data (2009–2010, 2015–2016)

Footnotes

ORCID iDs

Funding

Declaration of conflicting interests

Appendix A

References

Theorem 2 Tibshirani et al.³⁰