Abstract
Objective
Pneumonia remains a significant global health burden; however, accessible and robust predictors of community screening are limited. This study aimed to investigate the individual and synergistic associations of the hemoglobin-to-red cell distribution width ratio—a marker of systemic inflammation—and sleep duration with pneumonia prevalence using a nationally representative sample.
Methods
We conducted a cross-sectional analysis of 39,156 adult participants from the National Health and Nutrition Examination Survey 2011–2018. To address potential selection bias from missing data, Multiple Imputation by Chained Equations (MICE) was employed to retain the full cohort. Logistic regression and restricted cubic spline analyses were used to assess independent associations. Machine learning (XGBoost) models were developed to evaluate predictive performance and feature interactions.
Results
Both lower hemoglobin-to-red cell distribution width ratio and shorter sleep duration were significantly associated with increased pneumonia prevalence. In the fully adjusted logistic models, the association between hemoglobin-to-red cell distribution width ratio and pneumonia remained significant, although they were slightly attenuated after adjusting for metabolic comorbidities. The XGBoost model demonstrated consistent predictive performance and identified a complex nonlinear interaction between hemoglobin-to-red cell distribution width ratio and sleep duration.
Conclusions
Short sleep duration independently predicts pneumonia risk, while low hemoglobin-to-red cell distribution width ratio shows a potential indirect effect through metabolic pathways. Hemoglobin-to-red cell distribution width ratio and sleep metrics may serve as accessible, low-cost markers for pneumonia screening in community populations.
Keywords
Introduction
Pneumonia is a common infectious disease worldwide and the fourth leading cause of death globally. It represents a major public health issue, imposing a substantial economic burden and significant health impacts on the society.1,2 Pneumonia is broadly defined by the presence of clinical symptoms combined with radiographic changes on chest X-ray. 3 However, the diagnostic gold standard is the identification of respiratory pathogens cultured from specimens obtained from the lungs, typically via bronchoalveolar lavage, pleural fluid sampling, or lung biopsy.2,4 Despite the advent of advanced diagnostic imaging, there is no widely available population-level biomarker to predict pneumonia risk early. As traditional inflammatory biomarkers (C-reactive protein and interleukin (IL)-6) are costly for screening, hemoglobin-based indices from routine blood tests may provide a practical alternative. Therefore, identifying novel biomarkers capable of predicting the risk of pneumonia is crucial for population health management.
The hemoglobin-to-red cell distribution width (RDW) ratio (HRR) is an emerging biomarker. It is calculated as hemoglobin (g/dL) divided by RDW.5,6 Both parameters are obtained from routine complete blood count. Hemoglobin is an iron-containing protein in red blood cells whose primary function is oxygen transport. RDW reflects the heterogeneity in peripheral red blood cell volume and may also indicate nutritional status and hematopoietic function. 7 HRR was first introduced by Sun et al. in lung cancer research and has since been validated in relation to the onset, progression, and prognosis of various conditions, including heart failure, myocarditis, depression, and stroke.8–10 These studies confirm the clinical relevance of HRR. Parallel to the development of such biomarkers, recent advancements in artificial intelligence have demonstrated significant potential in clinical decision-making and respiratory health. Machine learning frameworks have been successfully applied to distinguish between different types of pneumonia to guide medical interventions 11 and achieve high-precision lung parenchyma segmentation in thoracic imaging. 12 Furthermore, the development of heterogeneous network-based online learners has provided new perspectives for artificial intelligence–enabled modeling in complex clinical scenarios. 13 Despite these clinical and methodological advancements, evidence regarding the relationship between HRR combined with sleep duration and pneumonia remains scarce. This makes it a highly promising subject for research within large-scale epidemiological datasets. Our study aims to examine the association of HRR and sleep duration with pneumonia, thereby providing valuable insights for clinical decision-making.
Methods
Study population
This was a population-based observational study based on data from the National Health and Nutrition Examination Survey (NHANES). NHANES is a national, cross-sectional survey implemented by the Centers for Disease Control and Prevention and the National Center for Health Statistics (NCHS), utilizing a complex, multistage sampling methodology. It integrates household interviews, laboratory tests, and physical examinations to obtain a large, nationally representative clinical dataset. The study procedures were approved by the NCHS Research Ethics Review Board, and informed consent was signed by each participant. Data from the 2011–2018 NHANES surveys were used in this study.
In this study, Multiple Imputation by Chained Equations (MICE) was applied to the entire dataset (N = 39,156) to handle missingness. By including the outcome variable in the imputation model, we leveraged the correlation structures of all available sociodemographic and clinical predictors to provide a robust estimation of missing values. This approach allowed us to maintain the national representativeness of the NHANES cohort and strictly adhere to the intention-to-treat principle in epidemiological modeling.
Definition of pneumonia
We assessed questions related to respiratory symptoms. Participants were diagnosed with pneumonia if they answered “Yes” to the question “Did you have a head cold or chest cold that started during those 30 days?” or to the question “Did you have a stomach or intestinal illness with vomiting or diarrhea that started during those 30 days?”
HRR and sleep duration
The HRR was derived from the complete blood count, calculated as hemoglobin (g/dL) divided by the RDW. Blood samples were collected and processed at the NHANES mobile examination centers, following the detailed protocols in the NHANES laboratory/medical technologists’ procedures manual. Sleep duration data were sourced from the NHANES Sleep Disorders Questionnaire (SLQ) module.
Covariates
To ensure the comprehensiveness of the study, we included several covariates based on the literature and the experience of our clinical team. These covariates encompassed demographic characteristics, lifestyle factors, and clinical disease information. Demographic characteristics included age (years); sex (male/female); race/ethnicity (Mexican American, Other Hispanic, Non-Hispanic White, Non-Hispanic Black, or Other Race); poverty-income ratio (PIR) categorized as ≤1, >1 to ≤3, and >3; marital status (married, widowed, divorced, separated, never married, and living with partner); educational level (less than 9th grade, 9–11th grade, high school graduate/GED, some college or AA degree, or college graduate or above); and body mass index (BMI) (kg/m2). BMI was calculated as weight in kilograms divided by height in meters squared and was categorized as underweight (<20 kg/m2), normal weight (≥20 to <25 kg/m2), overweight (≥25 to <30 kg/m2), and obese (≥30 kg/m2). Lifestyle factors included smoking status (current smoker/never smoker) and alcohol consumption (drinker/nondrinker, based on self-report). Clinical disease information included the status of hypertension, diabetes, and hyperlipidemia, which were obtained from self-reports by participants or their proxies.
Statistical analysis
In this descriptive analysis, continuous variables were presented as mean ± standard deviation, while categorical variables were expressed as frequency and percentage. Unpaired Student’s t-test and chi-square tests were used for comparisons.
First, multiple logistic regression models were employed to assess the relationships between HRR, sleep duration, and pneumonia incidence. The first quartile group (Q1) and individuals with sleep duration >7 h were used as reference groups. Trend tests were conducted based on HRR quartiles. Model 1 was not adjusted for covariates. Model 2 was adjusted for age, sex, and race. Model 3 was adjusted for age, sex, race, educational level, marital status, BMI, PIR, smoking status, alcohol consumption, diabetes, hypertension, and hyperlipidemia.
Second, subgroup analyses were performed according to sex, age, race, BMI, and PIR. Interaction effects between HRR, sleep duration, and pneumonia were explored across different subgroups. Age was divided into two groups (≤50 years, >50 years). BMI was categorized into four groups (BMI <20 kg/m2, 20 kg/m2 ≤BMI ≤25 kg/m2, 25 kg/m2 ≤BMI ≤30 kg/m2, and BMI >30 kg/m2). In subgroup analyses, models were not adjusted for the stratification variable itself.
Third, restricted cubic spline (RCS) regression with four knots was used to investigate potential nonlinear associations between HRR, sleep duration, and pneumonia.
Finally, we applied the Extreme Gradient Boosting (XGBoost) algorithm to construct a prediction model and evaluate the predictive capability of input features for the study outcome. This is an ensemble learning method based on a gradient boosting framework. It iteratively constructs multiple weak classifiers (decision trees) and combines them into a strong classifier through weighting, thereby optimizing model performance in classification tasks.
To address the class imbalance inherent in the dataset (where pneumonia cases are relatively rare), we incorporated a cost-sensitive learning weight (scale_pos_weight) and utilized the area under the precision–recall curve (AUPRC) as the primary evaluation metric during model tuning, which provides a more rigorous assessment for imbalanced clinical outcomes than the standard area under the curve (AUC). The full dataset (N = 39,156) was randomly partitioned into a training set (70%) and an independent test set (30%). Five-fold cross-validation was performed on the training set to optimize critical hyperparameters, including the learning rate, maximum tree depth, and subsampling ratio. The model’s discriminative performance was further validated on the test set by calculating the AUC and AUPRC. Feature importance was ranked based on gain scores to identify the most contributive predictors.
Ethical considerations
The National Center for Health Statistics Research Ethics Review Board approved all NHANES survey protocols. All participants provided written informed consent. This study used publicly available, de-identified NHANES data, which exempted it from further ethical review.
Results
Baseline characteristics of participants
A total of 39,156 participants were included, with 6438 (16.4%) in the pneumonia group (Table 1). The pneumonia group was significantly younger than the non-pneumonia group (29.25 ± 24.16 vs. 32.83 ± 24.92 years, p < 0.001).
Basic characteristics of the study population in NHANES 2011–2018.
NHANES: National Health and Nutrition Examination Survey; BMI: body mass index; PIR: poverty income ratio
Demographic and socioeconomic analysis revealed significant disparities (all p < 0.001). Pneumonia prevalence was higher among participants with lower educational level (<9th grade: 8.8%) and those in the lowest income bracket (PIR ≤ 1: 29.0%). Racial distribution also varied significantly, with Non-Hispanic Whites (31.9%) and Non-Hispanic Blacks (24.1%) comprising the majority of pneumonia cases.
Regarding lifestyle and clinical factors, significant associations were found for alcohol consumption (p = 0.005) and hypertension (p = 0.011). BMI demonstrated a strong association with pneumonia status (p < 0.001), with the highest proportion of pneumonia cases observed in the underweight group (BMI < 20 kg/m2: 32.8%). Interestingly, no significant differences were observed between groups in terms of smoking (p = 0.556), hyperlipidemia (p = 0.056), or diabetes (p = 0.638).
Relationship between HRR, sleep duration, and pneumonia
Table 2 presents the results of the multivariable logistic regression across three models, showing the associations of HRR and sleep duration with pneumonia. The inverse association between HRR and the prevalence of pneumonia was significant in Model 1 and Model 2 but became nonsignificant in the fully adjusted model. A significant positive association was observed between short sleep duration and the prevalence of pneumonia, which remained statistically significant across all models.
Logistic regression analysis for the association of HRR and sleep duration with pneumonia risk and trend test.
HRR: hemoglobin-to-red cell distribution width ratio; OR: odds ratio; CI: confidence interval; Ref: reference group; Q1–Q4: quartiles 1–4; Model 1: no covariates adjusted; Model 2: Adjusted for age, sex, and race; Model 3: Adjusted for age, sex, race, educational level, poverty income ratio, smoking, alcohol consumption, body mass index, hypertension, hyperlipidemia, and diabetes.
Dose–response relationship between HRR, sleep duration, and pneumonia risk
This study employed RCS models to evaluate the nonlinear associations of the HRR and sleep duration with the risk of pneumonia. The analysis confirmed linear relationships between pneumonia and both HRR (P for nonlinear = 0.494) and sleep duration (P for nonlinear = 0.181). The prevalence of pneumonia significantly increased with decreasing HRR levels and reducing sleep duration, as illustrated in Figures 1 and 2.

Flowchart of participant selection.

Nonlinear dose–response association between sleep duration and outcome risk.
Subgroup analysis of HRR, sleep, and pneumonia associations
To examine the potential associations of HRR and sleep duration with pneumonia in the study population, we conducted subgroup analyses and interaction tests for all baseline characteristics. After stratification, interactions for age, sex, race/ethnicity, PIR, BMI, hypertension, hyperlipidemia, and diabetes were not significant (all P for interaction > 0.05). These results indicate that the effects of HRR and sleep duration on pneumonia remained relatively consistent across most subgroups, as shown in Figures 3 and 4.

Subgroup analysis of the association between HRR and outcome risk. HRR: hemoglobin-to-red cell distribution width ratio.

Subgroup analysis of the association between sleep duration and outcome risk.
Machine learning model performance and feature importance
To further explore the complex associations between the HRR, sleep metrics, and pneumonia prevalence, we implemented the XGBoost algorithm on the full MICE-imputed national cohort (N = 39,156). This advanced machine learning approach was employed to specifically account for potential nonlinear relationships and high-dimensional interactions among the sociodemographic and clinical variables. The final hyperparameters are summarized in Table S1.
In the independent test set, the XGBoost model achieved an AUC of 0.619 and an AUPRC of 0.241. Given that the baseline prevalence of pneumonia in this representative US population was 16.1%, the AUPRC of 0.241 demonstrated that the model maintains a meaningful degree of discriminative capacity in identifying high-risk individuals within a highly imbalanced, real-world dataset.
The feature importance analysis (based on information gain) revealed that sleep duration, age, and HRR were among the primary drivers of the model’s predictive performance. Notably, the interaction term between HRR and sleep duration (HRR_sleep) emerged as a significant contributor, confirming that the synergy between hematological markers and lifestyle factors provides critical information for pneumonia risk stratification (Figures 5 and 6). These results suggest that the integration of multidimensional data via machine learning can effectively identify risk signals that are intrinsic to large-scale epidemiological cohorts.

Variable importance ranking in the machine learning model.

SHAP-based interpretation of model predictors. SHAP: Shapley additive explanations.
Discussion
This cross-sectional study examines the potential relationship between the HRR combined with sleep duration and the incidence of pneumonia. We analyzed data from 39,156 participants in the NHANES database between 2011 and 2018. A negative correlation was observed; lower HRR and shorter sleep duration were associated with the prevalence of pneumonia. Notably, sex, age, BMI, PIR, coronary heart disease, hypertension, and diabetes did not significantly influence this relationship. These findings suggest that HRR and sleep duration may serve as important markers for pneumonia prevalence. Sleep duration is an independent modifiable risk factor for pneumonia; HRR may reflect the inflammatory and metabolic environment underlying susceptibility to infection. The attenuation of the HRR effect after adjusting for covariates indicates that its role is partially mediated through chronic diseases such as diabetes or obesity—conditions that alter red blood cell morphology and oxygen-carrying kinetics.
The lung’s core role in gas exchange requires direct exposure to the ambient air, an environment inherently laden with diverse microbes, including influenza virus, Streptococcus pneumoniae, and Staphylococcus aureus.14,15 Therefore, a dynamic and low-biomass microbial community is harbored in the healthy lung. A sophisticated respiratory barrier system, comprising physical–chemical, microbial, and cellular immune components, safeguards this interface.16,17 However, perturbations such as airway structural remodeling, ciliary dysfunction, or other compromises to these defenses can trigger a shift in the lung microbiome, creating an opportunity for pneumonia to develop. 18 The breach of this barrier prompts the massive release of proinflammatory mediators, including IL-6, type II interferon-γ, monocyte chemoattractant protein-1, and interferon gamma-induced protein 10, which in turn amplifies the inflammatory cascade and significantly heightens the risk of severe infection.
The core structural features of hemoglobin and its pathological alterations in coronavirus disease 2019 (COVID-19)
The core structure of hemoglobin is characterized by a globular tetramer composed of four subunits. Recent studies have demonstrated that in patients with COVID-19, this structure can be disrupted, leading to the release of carbon monoxide, free iron, and various degradation products. Through the formation of carboxyhemoglobin and the generation of toxic reactive oxygen species (ROS)—such as O2•-, H2O2, and •OH—the normal oxygen-carrying function of hemoglobin is compromised. 19 Heme oxygenase-1, a rate-limiting enzyme in the degradation pathway of hemoglobin, has been found to be dysregulated in certain conditions. Deficiencies in its expression have been linked to the modulation of chronic inflammatory processes in humans. 20 As another component of the hemoglobin-related response, RDW has been established as a biomarker of physiological dysregulation.21,22 The principal mechanisms involve promoting hematopoietic cell senescence, enhancing oxidative stress, driving inflammatory responses, and enriching apoptotic pathways. A meta-analysis has confirmed that elevated RDW values can predict both the incidence and severity of COVID-19. 22 The HRR is a novel biomarker that effectively integrates the clinical significance of both hemoglobin and RDW. 23 It has already been associated with various conditions, including depression, stroke, and heart failure. A decreased HRR level often indicates disorders such as anemia and impaired erythrocyte maturation, reflecting underlying problems in pulmonary oxygen supply and transport. Persistent pulmonary inflammation and lung consolidation can exacerbate these issues, thereby rendering HRR as a highly sensitive indicator in the context of lung infections. The weakened association between HRR and pneumonia in the fully adjusted model suggests that its effects may be mediated through metabolic or inflammatory pathways overlapping with pneumonia or chronic comorbidities. The relationship between sleep and immune status is widely acknowledged, with “getting a good night’s sleep” often regarded as a supportive measure against infections. 24 Research by Späth-Schwalbe et al. has shown that increased levels of IL-6 can lead to reduced sleep duration. 25 Concurrently, sleep deprivation has been found to impair the proliferative capacity of T and B cells as well as diminish their ability to uptake viruses, thereby weakening the overall immune response. 26
This study has several notable strengths. To the best of our knowledge, it is the first investigation to jointly examine the association of the HRR and sleep duration with pneumonia prevalence. A major strength is the use of MICE on a large, nationally representative cohort, which effectively mitigated selection bias and ensured that our findings are generalizable to the broader US population. Furthermore, we incorporated a comprehensive range of sociodemographic and clinical covariates to control for potential confounding, thereby enhancing the robustness of the observed associations. The integration of advanced statistical techniques, including RCS to explore nonlinearity and XGBoost machine learning to evaluate predictive signals, provides a multi-dimensional perspective on pneumonia risk stratification.
However, several limitations warrant consideration. First, the cross-sectional design of the NHANES database precludes the determination of causal relationships between HRR, sleep duration, and pneumonia. Second, the diagnosis of pneumonia was based on participant self-reports, which may introduce recall bias, although such self-reported data in NHANES have been widely validated in epidemiological research. Regarding the machine learning component, although the XGBoost model demonstrated modest discriminative ability, it effectively captured complex interactions that are often overlooked by linear models. Our feature importance analysis further clarified the significant contributions of sleep duration and HRR to the model’s decision-making process. Future prospective longitudinal studies are needed to validate these findings and further explore the underlying biological mechanisms.
In subsequent research, we aim to further investigate the potential biological mechanisms linking HRR, sleep duration, and pneumonia. Although our fully adjusted logistic models suggested that the direct independent association between HRR and pneumonia prevalence may be attenuated by metabolic and inflammatory comorbidities, XGBoost analysis underscored a significant nonlinear interaction between HRR and sleep patterns. This suggests that HRR may serve as a critical physiological buffer whose predictive value is contingent upon lifestyle factors. Therefore, prospective studies are warranted to clarify the temporal role of HRR in pneumonia development and to determine whether optimizing sleep can mitigate the risks associated with low HRR. Additionally, future investigations should validate these findings across more diverse ethnic and clinical populations. Finally, large-scale longitudinal studies with extended follow-up periods are needed to establish clinically safe exposure thresholds and further elucidate the synergistic interrelationships among these multi-dimensional risk factors.
Conclusion
Our findings demonstrate that sleep duration was inversely associated with pneumonia prevalence across all models. In Models 1 and 2, HRR also demonstrated a significant inverse relationship with pneumonia prevalence. These results suggest the potential of both HRR and sleep duration to serve as simple and cost-effective indicators for screening pneumonia risk in community settings.
Supplemental Material
sj-pdf-1-imr-10.1177_03000605261448082 - Supplemental material for Hemoglobin-to-red cell distribution width ratio and sleep duration as predictors of pneumonia: Evidence from NHANES 2011–2018
Supplemental material, sj-pdf-1-imr-10.1177_03000605261448082 for Hemoglobin-to-red cell distribution width ratio and sleep duration as predictors of pneumonia: Evidence from NHANES 2011–2018 by Yunhao Jiang and Chen Chen in Journal of International Medical Research
Footnotes
Acknowledgements
We would like to express our gratitude to the participants and the staff of the National Health and Nutrition Examination Survey (NHANES) for their valuable contributions to the data collection and management. We also thank the National Center for Health Statistics (NCHS) for making the data publicly available for research purposes.
Author contributions
Chen Chen: Writing–original draft and Writing–review & editing. Yunhao Jiang: Investigation, Software, and Writing–review & editing.
Conflict of interest statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Data availability statement
Declaration of conflicting interests
We declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
