Explainable residual ensemble modelling for EuroQol-5 dimensions-based quality-of-life assessment and stratification in patients with knee osteoarthritis

Abstract

Objective

To develop an interpretable machine-learning framework for supporting quality-of-life (QoL) assessment and stratification in patients with knee osteoarthritis (OA) by integrating linear and nonlinear modelling strategies.

Methods

This retrospective study utilised de-identified clinical data from 1,102 patients with knee OA collected at a university hospital in South Korea between September 2013 and January 2022 and made available via the AI Hub platform. QoL was assessed using the EuroQol-5 Dimensions (EQ-5D) index and dichotomised at 0.7. A residual ensemble model combining logistic regression (LR) and a Random Forest residual learner was developed and evaluated using stratified train–test split and three-fold cross-validation. Model performance was assessed using accuracy, F₁-score, ROC-AUC, and PR-AUC. Model interpretability was examined using LR coefficients and SHAP analysis.

Results

The proposed model achieved superior performance on the independent test set (accuracy = 0.88, precision = 0.85, recall = 0.83, F₁ = 0.84, ROC-AUC = 0.93, PR-AUC = 0.91), outperforming individual baseline models. Key predictors included functional limitation (WOMAC), pain severity (VAS), and surgical history (TKA). Incorporating interaction features further improved accuracy to 0.89 without compromising interpretability.

Conclusions

The proposed residual ensemble framework effectively balances predictive performance and interpretability, providing a clinically meaningful framework for QoL assessment and risk stratification in knee OA. This approach supports the development of explainable decision-support tools in digital musculoskeletal health.

Keywords

knee osteoarthritis quality of life (QoL)residual ensemble learning explainable artificial intelligence (XAI)machine learning in healthcare

Introduction

Knee osteoarthritis (OA) is one of the most prevalent musculoskeletal disorders in older adults and a leading cause of chronic pain, functional limitations, and reduced quality of life (QoL).¹ In South Korea, its prevalence continues to rise with population ageing, affecting more than one-third of adults aged 65 years or older.² Globally, knee OA contributes substantially to disability, healthcare costs, and caregiver burden.³ Patient-reported measures of QoL, such as the EuroQol-5 Dimensions (EQ-5D) and Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC), are widely used to assess disease burden and monitor rehabilitation outcomes.^4–6 Given these clinical and socioeconomic challenges, developing reliable and interpretable frameworks for assessing and stratifying QoL status in patients with knee OA remains an important step towards individualised and data-informed rehabilitation planning.

QoL outcomes in patients with knee OA are shaped by interrelated physical, psychological, and social factors, including pain intensity, mobility restriction, obesity, depression, and comorbid conditions.^7–9 These multidimensional influences often interact in nonlinear ways, which traditional statistical models cannot fully capture. Although logistic regression (LR) and other linear models are valued for their calibration stability and coefficient-level interpretability, they are limited in representing threshold effects—such as a sharp decline in QoL beyond a certain pain level—or higher-order interactions between factors, such as body-mass index and functional measures.^10–12

Tree-based ensemble methods, including Random Forests and gradient boosting machines, have shown strong predictive capacity in tabular clinical datasets. Compared with deep-learning architectures, they are computationally efficient, perform robustly with moderate sample sizes, and offer feature-importance measures that facilitate interpretability.^13–15 Deep-learning models provide greater flexibility in capturing complex feature interactions; however, their interpretability remains limited, and their computational demands may restrict clinical applicability.^16,17 In QoL assessment and stratification, both accuracy and transparency are critical; clinicians and patients must be able to understand the rationale behind model outputs to trust and act upon them.

Recent advances in explainable artificial intelligence, such as Shapley Additive Explanations (SHAP), have improved the interpretability of complex models. However, most knee OA-related applications have focused on diagnosis or disease progression rather than patient-reported outcomes.^18–20 To address this gap, the present study proposes a residual ensemble framework that integrates the interpretability of a logistic-regression baseline with the nonlinear adaptability of a Random Forest residual learner. Using real-world hospital data from 1,102 patients with knee OA, this study pursued three objectives: (1) to develop an interpretable framework for integrating multidimensional clinical and patient-reported information associated with QoL status; (2) to evaluate a residual logistic–forest ensemble; and (3) to interpret both linear and nonlinear contributions using explainable AI techniques.

By extending model evaluation from prediction to interpretation, this study seeks to provide practical insights into computer-aided QoL management and digital decision support for musculoskeletal rehabilitation. This integrative framework aims to contribute to the development of explainable, data-driven tools that enhance personalised care and support the broader goals of digital health transformation.

Methods

Study population and dataset

This study is a retrospective observational study based on de-identified clinical data collected at Jeju University Hospital in South Korea between September 2013 and January 2022 and subsequently released through the AI Hub platform as part of a government-supported AI development initiative. The dataset comprised 1,102 patients diagnosed with knee osteoarthritis (OA), confirmed by ICD-10 codes and radiographic evidence. The dataset was accessed through the AI Hub platform following a formal application and approval process and was used in accordance with the platform’s data usage policies.

The dataset contained 29 clinical variables, including demographic characteristics, radiographic severity, surgical history, pain and function scales, physical performance tests, joint range of motion, lower-limb strength measures, and stair-climbing performance (Table 1). No missing values were present, and all variables were numerical; therefore, no categorical encoding or imputation was required.

Table 1.

Description of input features used for model development.

No.	Feature name	Description
1	Age (years)	Patient’s age in years
2	Sex (male/female)	Patient’s biological sex
3	Height (cm)	Patient’s height in centimetres
4	Weight (kg)	Patient’s weight in kilograms
5	Body Mass Index (kg·m^-2)	Ratio of weight to height squared, an indicator of body fat
6	Left Knee Kellgren–Lawrence Grade	Radiographic osteoarthritis grade of the left knee
7	Right Knee Kellgren–Lawrence Grade	Radiographic osteoarthritis grade of the right knee
8	History of Total Knee Arthroplasty (TKA)	History of total knee replacement surgery
9	TKA Surgical Site	Side of the knee on which TKA was performed
10	WOMAC Function	Physical function subscore of the WOMAC
11	WOMAC Pain	Pain subscore of the WOMAC
12	WOMAC Stiffness	Stiffness subscore of the WOMAC
13	Visual Analogue Scale	Pain assessment scale ranging from no pain to worst imaginable pain
14	Timed Up and Go Test (s)	Time taken to stand up, walk, and sit down
15	6-Minute Walk Test (m)	Distance walked in 6 min
16	Left Knee Flexion Range of Motion (°)	Maximum degree to which the patient can flex the left knee
17	Left Knee Extension Range of Motion (°)	Maximum degree to which the patient can extend the left knee
18	Right Knee Flexion Range of Motion (°)	Maximum degree to which the patient can flex the right knee
19	Right Knee Extension Range of Motion (°)	Maximum degree to which the patient can extend the right knee
20	Left Knee Flexion Force (% body weight)	Strength of left knee flexor muscles
21	Left Knee Flexion Torque (N·m)	Torque generated during left knee flexion
22	Left Knee Extension Force (% body weight)	Strength of left knee extensor muscles
23	Left Knee Extension Torque (N·m)	Torque generated during left knee extension
24	Right Knee Flexion Force (% body weight)	Strength of right knee flexor muscles
25	Right Knee Flexion Torque (N·m)	Torque generated during right knee flexion
26	Right Knee Extension Force (% body weight)	Strength of right knee extensor muscles
27	Right Knee Extension Torque (N·m)	Torque generated during right knee extension
28	Stair Climb Test (SCT) Ascend (s)	Time taken to ascend stairs during the SCT
29	SCT Descend (s)	Time taken to descend stairs during the SCT

WOMAC, Western Ontario and McMaster Universities Osteoarthritis Index.

The target variable was the EQ-5D index score, a validated measure of health-related QoL developed by the EuroQol Group. The EQ-5D comprises five domains—mobility, self-care, usual activities, pain/discomfort, and anxiety/depression—each assessed on a three-point scale. Responses were converted into a single index ranging from zero (worst imaginable health state) to one (perfect health state). For modelling, the EQ-5D index was binarised using a threshold of 0.7, consistent with prior work identifying EQ-5D utility values below 0.7 as indicative of treatment failure in OA.²¹ Scores below 0.7 were assigned to Class 0 (lower QoL), and scores ≥ 0.7 to Class 1 (higher QoL). This resulted in an imbalanced distribution, with 705 patients (64.0%) in Class 0 and 397 patients (36.0%) in Class 1. During model training, class weights were applied to mitigate imbalance; no artificial resampling or data augmentation techniques were used.

The Institutional Review Board (IRB) of Korea University Anam Hospital approved the study protocol (IRB No. 2022AN0110). All data were fully anonymised before release to the AI Hub, and the requirement for informed consent was waived by the IRB in accordance with national ethical and legal guidelines.

Modelling framework

A stepwise modelling framework was designed to balance interpretability and predictive performance, as illustrated in Figure 1. LR was employed as the linear baseline to capture direct and interpretable relations between clinical features and QoL outcomes. LR was chosen for its statistical efficiency, calibration stability, and coefficient-level interpretability, enabling clinically meaningful associations between predictors and outcomes to be readily identified. All input features were standardised using StandardScaler prior to model training. The scaler was fitted on the training data and applied consistently to the validation and test sets within a pipeline framework to prevent data leakage.

Figure 1.

Stepwise workflow of the proposed QoL assessment and stratification framework.

To address limitations of the linear specification, several nonlinear learners with strong predictive capacity for tabular clinical data were considered, including Random Forest, XGBoost, LightGBM, CatBoost, and TabNet.

All these standalone nonlinear models were trained using the original feature set for comparative evaluation. In contrast, for the residual ensemble model, the nonlinear learner was trained on an extended feature set that included both the original features and the engineered interaction terms.

Each model was independently optimised and evaluated using three-fold cross-validation on the training data. Based on overall performance and stability, the best-performing model was selected for residual learning.

After fitting the LR baseline, predicted probabilities were obtained for each sample, and residuals were defined as the difference between the observed binary labels and the LR-predicted probabilities (residual = y − p_LR). The selected nonlinear model was then trained to predict these residuals, thereby learning a nonlinear correction term for the LR baseline rather than directly performing classification.

For the residual ensemble model, feature engineering was performed independently of residual modelling. Guided by both the LR coefficient analysis and clinical plausibility, three influential variables—WOMAC function, VAS pain score, and TKA surgical site—were selected, and pairwise interaction terms were constructed. These interaction features were appended to the original feature set to enhance the model’s ability to capture higher-order relationships.

Although some degree of collinearity may exist among clinical variables, particularly those related to pain and functional status, the use of L1-regularised logistic regression helps mitigate its impact by promoting sparse coefficient estimates. Furthermore, the tree-based residual learner is less sensitive to collinearity, as it relies on hierarchical feature partitioning rather than linear assumptions.

The final prediction was computed using a residual correction scheme:

p_final = clip (p_LR + η \cdot \hat{r}, 0, 1),

where p_LR denotes the probability predicted by the logistic regression model, r̂ represents the residual predicted by the nonlinear model, and η is a scaling factor. In this study, η was set to 0.1 to ensure stable and conservative correction of the LR predictions. This formulation enables the nonlinear model to act as a correction term that adjusts the LR baseline prediction, rather than serving as an independent classifier.

Finally, the proposed residual ensemble operates through the residual correction scheme described above, rather than through conventional averaging or stacking of model outputs.

Model training and evaluation protocol

The dataset was randomly partitioned into a training set (80%) and a test set (20%), with stratification to preserve class distribution. During training, three-fold cross-validation was applied across the LR baseline and all nonlinear learners to mitigate overfitting and identify the best-performing nonlinear model for residual integration. Model performance was primarily assessed on the independent test set using the following standard classification metrics: accuracy, precision, recall, F₁-score, area under the receiver operating characteristic curve (ROC-AUC), and area under the precision-recall curve (PR-AUC). To address class imbalance in the target variable, class-weight adjustments were applied during training to penalise misclassification of the minority (high-QoL) class. In this study, the F₁-score was adopted as the primary selection criterion because it balances precision and recall, thereby reflecting the clinical priority of accurately identifying patients at risk of poor QoL, whereas accuracy alone could be misleading under class imbalance.

Interpretability and clinical translation

To enhance interpretability, a dual-layer explanation strategy aligned with the framework’s hybrid design was employed. First, the LR baseline provided coefficient estimates that could be directly translated into odds ratios, thereby offering global and clinically intuitive insights into risk and protective factors influencing QoL in patients with knee OA. These estimates highlighted transparent associations among demographic, functional, and biomechanical variables and the likelihood of reduced QoL.

Second, for the nonlinear residual component, a SHAP-based analysis was applied to quantify and visualise feature contributions to the residual corrections. This approach enabled the identification of complex interactions and higher-order effects that could not be captured by the linear model alone. Together, these complementary interpretive layers—global coefficients from the linear baseline and localised SHAP explanations from the residual learner—provide both trustworthy and actionable insights, supporting clinical decision-making in QoL management.

Statistical analysis

Descriptive statistics were used to summarise the class distribution of the study population. The positive class was defined as Class 1 (high QoL) for the calculation of class-specific performance metrics. Accordingly, precision, recall, F₁-score, and PR-AUC were computed with respect to the high-QoL class.

Cross-validation results are presented as mean ± standard deviation across folds. Odds ratios were calculated as exp(β) from logistic regression coefficients.

In addition to discrimination metrics, calibration performance was assessed using Brier score and log loss to evaluate the agreement between predicted probabilities and observed outcomes. Calibration curves were visually inspected on the independent test set.

Decision-curve analysis (DCA) was performed to evaluate the potential clinical utility of the proposed models by estimating net benefit across a range of threshold probabilities. No formal hypothesis testing was conducted, as the primary objective of this study was predictive modelling and model evaluation rather than inferential statistical comparison.

Results

Baseline performance: LR

As shown in Table 2, the LR baseline model achieved consistent and well-calibrated performance in predicting QoL outcomes in patients with knee OA. Across the three-fold cross-validation, LR yielded an average accuracy of 0.85 (± 0.01) and F₁-score of 0.79 (± 0.03), reflecting balanced precision (0.84 ± 0.03) and recall (0.74 ± 0.05). Discriminative ability was high, with a ROC-AUC of 0.91 (± 0.02) and PR-AUC of 0.88 (± 0.01), indicating robust classification performance even under class imbalance. These findings support the suitability of LR as a clinically interpretable baseline model that provides both competitive performance and coefficient-level transparency.

Table 2.

Model comparison using three-fold cross-validation on the training dataset.

Models	Accuracy	Precision	Recall	F₁ score	ROC-AUC	PR-AUC
LR (base)	0.85 (0.01)	0.84 (0.03)	0.74 (0.05)	0.79 (0.03)	0.91 (0.02)	0.88 (0.01)

ROC-AUC, area under the receiver operating characteristic curve; PR-AUC, area under the precision-recall curve; LR, logistic regression.

Performance of nonlinear models

Table 3 summarises the predictive performances of the nonlinear learners evaluated using three-fold cross-validation. All models achieved competitive discrimination, with ROC-AUC values ranging from 0.91 to 0.93 and PR-AUC values from 0.87 to 0.90, indicating stable classification performance under class imbalance. F₁-scores (0.78–0.80) reflected balanced trade-offs between precision and recall.

Table 3.

Model comparison using three-fold cross-validation on the training dataset. All models were trained using the original feature set.

Models	Accuracy	Precision	Recall	F₁ score	ROC-AUC	PR-AUC
RandomForest	0.86 (0.01)	0.79 (0.01)	0.81 (0.06)	0.80 (0.02)	0.93 (0.02)	0.90 (0.01)
XGBoost	0.85 (0.01)	0.82 (0.03)	0.75 (0.06)	0.78 (0.02)	0.92 (0.01)	0.89 (0.01)
LightGBM	0.86 (0.01)	0.81 (0.03)	0.79 (0.05)	0.80 (0.01)	0.92 (0.01)	0.88 (0.01)
CatBoost	0.85 (0.02)	0.78 (0.01)	0.82 (0.06)	0.80 (0.03)	0.92 (0.01)	0.90 (0.02)
TabNet	0.84 (0.01)	0.76 (0.04)	0.83 (0.01)	0.79 (0.01)	0.91 (0.01)	0.87 (0.02)

ROC-AUC, area under the receiver operating characteristic curve; PR-AUC, area under the precision-recall curve.

Among the candidates, the Random Forest demonstrated the most favourable balance between performance and stability. It achieved the highest ROC-AUC (0.93 ± 0.02) and PR-AUC (0.90 ± 0.01) while maintaining consistent F₁ performance (0.80 ± 0.02) with relatively low variance across folds. By contrast, gradient boosting variants (XGBoost, LightGBM, and CatBoost) achieved comparable average metrics but exhibited slightly greater variability in recall, whereas TabNet, despite strong recall (0.83 ± 0.01), demonstrated lower precision (0.76 ± 0.04) and overall discrimination.

Based on its combination of robust discrimination, low variance, and interpretability through straightforward feature-importance measures, the Random Forest was selected as the nonlinear learner for integration into the residual ensemble with the LR baseline. This selection ensured that the ensemble captured nonlinear interactions while preserving stability and interpretability for clinical translation.

Hyperparameter optimisation

The hyperparameters for each predictive model were optimised using a stratified three-fold cross-validation strategy. Table 4 summarises the best-performing hyperparameter combinations identified for each model.

Table 4.

Optimised hyperparameters for each predictive model obtained.

Models	Hyperparameters
LR (base)	logreg_C: 0.1, logreg_class_weight: None, logreg_penalty: ‘l1’, logreg_solver: ‘saga’
RandomForest	class_weight: ‘balanced’, max_depth: None, min_samples_split: 5, n_estimators: 200
XGBoost	learning_rate: 0.2, max_depth: 8, n_estimators: 500, scale_pos_weight: 1.78
LightGBM	class_weight: None, learning_rate: 0.2, max_depth: -1, n_estimators: 100, num_leaves: 31
CatBoost	depth: 7, iterations: 200, learning_rate: 0.1, auto_class_weights=‘balanced’
TabNet	n_d: 16, n_steps: 4, gamma: 1.0, lambda_sparse: 0.001, lr: 0.01

LR, Logistic regression.

Proposed ensemble model performance

The proposed model was evaluated using both the original features and engineered interaction features within the residual learning framework.

As summarised in Table 5, the proposed residual ensemble model, which integrates LR with Random Forest residual corrections, achieved an average accuracy of 0.85 (±0.01) and F₁-score of 0.79 (±0.03) across three-fold cross-validation. The model demonstrated high discrimination, with a ROC-AUC of 0.91 (±0.02) and PR-AUC of 0.88 (±0.02), indicating stable predictive performance comparable with the best-performing individual learners.

Table 5.

Model comparison using three-fold cross-validation on the training dataset. The proposed model was trained using both the original features and engineered interaction features within the residual learning framework.

Models	Accuracy	Precision	Recall	F₁ score	ROC-AUC	PR-AUC
Proposed model	0.85 (0.01)	0.84 (0.03)	0.74 (0.05)	0.79 (0.03)	0.91 (0.02)	0.88 (0.02)

ROC-AUC, area under the receiver operating characteristic curve; PR-AUC, area under the precision-recall curve.

To construct the ensemble, each component model was trained using previously optimised hyperparameters. This design choice ensured computational efficiency and reproducibility while avoiding the prohibitive cost of re-optimisation in a combined framework. By leveraging the best configurations from standalone model evaluations, the ensemble preserved the strengths of its constituents without incurring unnecessary overhead. Thus, the ensemble offers a balanced compromise among predictive accuracy, training efficiency, and interpretability, highlighting its potential for practical use in clinical QoL assessment and stratification.

Final evaluation on the independent test dataset

Table 6 presents the comparative performances of all models on the independent test dataset. Despite its simplicity, the LR achieved strong discrimination (ROC-AUC = 0.93) and a balanced profile of precision (0.80) and recall (0.83), underscoring its robustness as a clinically interpretable baseline. Among nonlinear learners, Random Forest achieved the highest recall (0.91), but at the expense of precision (0.70), whereas boosting-based models (XGBoost, LightGBM, and CatBoost) yielded moderate and balanced performance. TabNet showed competitive recall (0.85) but a comparatively lower ROC-AUC (0.89).

Table 6.

Model comparison on the test dataset. The proposed model was trained using both the original features and engineered interaction features within the residual learning framework.

Models	Accuracy	Precision	Recall	F₁ score	ROC-AUC	PR-AUC
LR	0.86	0.80	0.83	0.81	0.93	0.90
RandomForest	0.82	0.70	0.91	0.79	0.92	0.89
XGBoost	0.83	0.79	0.74	0.76	0.91	0.87
LightGBM	0.82	0.75	0.76	0.76	0.91	0.88
CatBoost	0.85	0.75	0.86	0.80	0.92	0.89
TabNet	0.84	0.74	0.85	0.79	0.89	0.85
Proposed model	0.88	0.85	0.83	0.84	0.93	0.91
Proposed model^*	0.89	0.87	0.81	0.84	0.93	0.91

ROC-AUC, area under the receiver operating characteristic curve; PR-AUC, area under the precision-recall curve; LR, logistic regression.

*Retrained with explicit interaction features.

The proposed residual ensemble model outperformed the individual models, achieving the highest accuracy (0.88) and a strongly balanced performance across all metrics (precision = 0.85, recall = 0.83, F₁ = 0.84, ROC-AUC = 0.93, PR-AUC = 0.91).

To evaluate the clinical reliability and practical utility of the proposed model, calibration and decision-curve analyses were conducted on the independent test dataset.

Calibration curves (Figure 2) showed that both the LR baseline and the proposed residual ensemble model produced reasonably well-aligned probability estimates relative to observed event rates. The proposed model achieved a slightly lower Brier score (0.0987 vs 0.0995) and log loss (0.3198 vs 0.3220) compared with the LR baseline, indicating marginal improvement in probability estimation while maintaining similar overall calibration characteristics.

Figure 2.

Calibration curves on the independent test set for the logistic regression (LR) baseline and the proposed residual ensemble model.

DCA (Figure 3) demonstrated that the proposed model provided comparable net benefit to the LR baseline across a wide range of threshold probabilities. Importantly, the proposed model consistently performed at least as well as the LR baseline, without evidence of reduced clinical utility. These findings suggest that the additional modelling complexity introduced by the residual ensemble does not compromise decision-making performance, while offering enhanced modelling flexibility.

Figure 3.

Decision-curve analysis (DCA) on the independent test set comparing the logistic regression (LR) baseline and the proposed residual ensemble model.

Furthermore, inspection of the LR coefficients (Table 7) highlighted WOMAC function, VAS pain score, and total knee arthroplasty (TKA) surgical site as the strongest predictors of QoL, with large negative associations indicating that high functional impairment, pain, and surgical history were strongly linked to low QoL. Motivated by these findings, we explicitly modelled the interaction among these variables in the ensemble framework, reasoning that their combined effect may exert nonlinear influences beyond their individual contributions. When retrained with these interaction features (Proposed model^*), the performance improved further, with an accuracy of 0.89 and precision of 0.87, high recall (0.81), and discrimination (ROC-AUC = 0.93, PR-AUC = 0.91).

Table 7.

Logistic regression coefficients (standardised) and odds ratios (per +1 standard deviation) for the top 10 predictors of quality of life.

Rank	Feature	Coefficient	Odds ratio
1	WOMAC Function	-0.93	0.40
2	VAS	-0.85	0.43
3	TKA surgical site	-0.48	0.62
4	WOMAC Pain	-0.41	0.66
5	TUG	-0.24	0.79
6	6MWT	0.19	1.20
7	Rt. Knee Flex. ROM	0.10	1.11
8	WOMAC Stiffness	0.10	1.10
9	Age	0.09	1.10
10	Rt. Knee Ext. Force	-0.04	0.96

WOMAC, Western Ontario and McMaster Universities Osteoarthritis Index; TKA, Total Knee Arthroplasty; ROM, range of motion.

Figure 4 illustrates the confusion matrix and ROC curve of Proposed model^*. The confusion matrix showed that the model achieved a balanced classification with high true-positive and true-negative rates, indicating reliable discrimination between patients with lower and higher QoL. The ROC curve further confirmed the strong discriminative ability, with an AUC of 0.93, which is consistent with the cross-validation and test results reported in Tables 5 and 6.

Figure 4.

Final model performance evaluation.

SHAP-based explainability of proposed model

Figure 5 summarises the SHAP-based interpretation of the proposed ensemble model. We found that WOMAC function, VAS score, TKA surgical site, pain, and TUG score were the most influential predictors of QoL, consistent with the LR analysis. The SHAP analysis further illustrates the direction of feature contributions at the individual patient level; higher Function and VAS scores (indicating greater disability and pain) were strongly associated with low predicted QoL (negative SHAP values), whereas high 6MWT and knee range of motion scores were positively associated. Notably, the TKA surgical site also emerged as an important nonlinear modifier, reinforcing the importance of considering surgical history in QoL assessment and stratification.

Figure 5.

Global feature importance based on Shapley Additive Explanations (SHAP) values.

Discussion

This study proposes an interpretable residual ensemble framework designed to support QoL assessment and stratification in patients with knee OA by integrating multiple clinically relevant dimensions, including pain, functional limitation, and surgical history. Rather than identifying novel determinants of QoL, the primary contribution of this work lies in providing a transparent modelling framework capable of capturing both linear and nonlinear relationships among established clinical factors.

While key variables such as pain and functional limitation are well established, the proposed approach focuses on modelling their interactions and higher-order effects that are not adequately captured by conventional regression models. This allows the model to extend, rather than replace, traditional statistical approaches by providing additional explanatory power while preserving clinical interpretability.

By combining LR with Random Forest residual corrections, the framework achieved a balanced and robust predictive performance (accuracy = 0.88, ROC-AUC = 0.93, PR-AUC = 0.91) while maintaining coefficient-level interpretability.

Building on the coefficient analysis of the baseline model, clinically plausible interaction features, particularly among functional limitations, pain, and surgical history, were introduced to yield a refined version of the ensemble. This enhanced model achieved the highest accuracy (0.89) and precision (0.87) without loss of recall or discrimination, suggesting that data-driven refinement guided by clinical interpretability can improve model performance. Collectively, these results suggest that integrating linear reasoning with nonlinear correction provides a practical and explainable framework for individualised QoL assessment and risk stratification in knee OA.

Previous studies on knee OA have primarily addressed disease diagnosis, radiographic grading, and surgical outcome prediction, with relatively limited attention paid to patient-reported QoL outcomes.²² Moreover, most prior machine learning applications have relied on single-model approaches, such as LR, random forest, or gradient boosting, which tend to prioritise either interpretability or predictive flexibility.²³ This study extends this line of research by demonstrating that a hybrid residual ensemble can reconcile this trade-off. By coupling a transparent logistic baseline with a nonlinear residual correction, the framework captured threshold and interaction effects, such as the compounded influence of pain severity and functional limitation, which linear models alone cannot represent. Simultaneously, it maintained direct interpretability through coefficient estimates and SHAP-based explanations, addressing the common limitations of opaque deep or ensemble learning methods. This integrative design advances prior QoL modelling efforts by providing strong discrimination together with transparent and clinically interpretable decision support, supporting its potential use in data-driven rehabilitation and personalised digital care.

These findings suggest that a hybrid approach may assist clinicians in understanding and managing patient-reported QoL outcomes in patients with knee OA. LR analysis identified pain, functional limitations, and surgical history as key correlates of reduced QoL, whereas the residual component captured additional nonlinear interactions that may reflect complex recovery dynamics.²⁴

From a clinical perspective, this structure can help practitioners stratify patients according to the risk of QoL deterioration and tailor rehabilitation strategies accordingly.²⁵

For instance, patients exhibiting high levels of pain and poor functional scores could be prioritised for early and intensive intervention, whereas those with favourable function may benefit from maintenance-oriented programmes.²⁶

Furthermore, SHAP-based feature attributes provide a transparent visualisation of individual-level influences, supporting shared decision-making and enhancing patient engagement with their rehabilitation plans.²⁷

However, several considerations should be noted when interpreting these findings. Specifically, several input variables, such as WOMAC scores and functional measures, are closely related to the QoL outcome itself. While these variables reflect clinically relevant dimensions that contribute to QoL, their strong association with the outcome may limit the extent to which the model provides independent predictive information. Therefore, the proposed framework is best viewed as an approach for integrating multiple QoL-related domains within an interpretable assessment and stratification framework.

Future studies incorporating longitudinal data or less directly related predictors may provide additional insight into causal relationships and improve clinical applicability. Despite these limitations, the findings illustrate how interpretable ensemble modelling can complement existing clinical assessment tools in a data-driven yet transparent manner.

From a methodological standpoint, the proposed residual ensemble provides an interpretable pathway for enhancing predictive accuracy without compromising transparency. The linear baseline offers global interpretability through coefficient estimates and odds ratios familiar to clinicians and is aligned with traditional statistical reasoning.^28,29 By contrast, the Random Forest residual learner captures localised nonlinear corrections, modelling higher-order interactions or threshold effects that the linear specification cannot represent. The combination of these paradigms helps bridge the gap between statistical inference and data-driven representation learning. SHAP-based analyses further enrich interpretability by decomposing predictions into feature-level attributes and clarifying the clinical variables that most strongly influence individual outcomes.^30,31 Notably, the nonlinear residual corrections largely reinforced the relationships identified in the logistic baseline, such as the interplay between pain and function, suggesting that the residual layer added nuanced flexibility rather than unnecessary complexity. This layered transparency is particularly relevant in healthcare contexts, where trust, reproducibility, and interpretability are prerequisites for adoption.³²

Although the performance improvement over the logistic regression baseline was modest, particularly in terms of ROC-AUC, the proposed framework offers additional value by enabling the modelling of nonlinear interactions while preserving interpretability. Given the already strong discriminative performance of the baseline logistic regression model, substantial gains in conventional performance metrics were inherently difficult to achieve. Importantly, the objective of the proposed framework was not to maximise predictive accuracy through increasingly complex architectures, but to extend a transparent statistical model with a lightweight nonlinear correction mechanism. By combining coefficient-based interpretation from logistic regression with SHAP-based explanations of the residual learner, the framework retains clinical transparency while providing additional flexibility to capture nonlinear relationships among clinically relevant variables.

The proposed framework may serve as an interpretable decision-support tool for QoL assessment and risk stratification in patients with knee osteoarthritis. In routine clinical practice, clinicians often need to integrate multiple patient-reported and clinical factors, including pain severity, functional limitation, and treatment history, when determining management priorities and rehabilitation strategies. The proposed framework provides a transparent mechanism for synthesising these factors into a single QoL-related risk estimate, which may facilitate identification of patients at risk of poor QoL, prioritisation of follow-up, and personalised rehabilitation planning.

Importantly, the intended role of the framework is not to replace clinical judgement but to support clinical decision-making through explainable risk assessment. This perspective is consistent with emerging evidence suggesting that the primary value of artificial intelligence in knee osteoarthritis lies in supporting assessment standardisation, facilitating interpretation of complex clinical information, and enhancing clinical decision-making rather than functioning as a fully autonomous system.³³ Recent studies have highlighted the potential of AI-assisted approaches to improve consistency in osteoarthritis assessment and to support integration of evidence-based recommendations into routine clinical workflows.³⁴ Accordingly, the proposed framework should be viewed as an interpretable extension of existing patient assessment strategies that supports, rather than replaces, clinician expertise. Future studies should evaluate whether implementation of explainable QoL assessment models can improve clinical decision-making, resource allocation, rehabilitation planning, and patient outcomes in real-world healthcare settings.

This study had several limitations. First, the dataset was derived from a single university hospital, which may limit the generalisability of the findings. The absence of external validation prevents direct assessment of model performance across different clinical settings and patient populations. Future studies should incorporate independent multicentre datasets and temporal validation across different time periods to evaluate the robustness, transportability, and real-world clinical applicability of the proposed framework. Therefore, the present study should be interpreted as a proof-of-concept for an interpretable QoL assessment and stratification framework rather than a fully generalisable clinical prediction model. Second, the model used only static clinical features measured at a single time point. As QoL evolves with changes in pain, mobility, and psychosocial state, longitudinal or time-series modelling approaches may better capture patient-specific trajectories. Another limitation is the dichotomisation of the EQ-5D index using a threshold of 0.7. While this approach provides a clinically interpretable decision boundary, it may result in information loss compared with modelling QoL as a continuous outcome. The threshold-based formulation was selected to facilitate clinically interpretable risk stratification and to align with previously reported treatment-failure criteria in patients with OA. Nevertheless, continuous outcome modelling may better capture the full spectrum and granularity of patient-reported QoL. In addition, the selected threshold may not be universally applicable across different populations or clinical contexts. Future studies should explore regression-based approaches or alternative threshold definitions to better represent QoL variation and enhance generalisability. Finally, although the ensemble was designed to balance interpretability and efficiency, future studies could extend the feature representation to include imaging, wearable, or sensor-derived data, thereby enhancing the model’s scope for digital health integration. Continued investigations along these lines may strengthen both the scientific validity and real-world applicability of explainable ensemble learning for personalised QoL management.

Conclusion

This study developed a residual ensemble framework that combines the interpretability of a linear baseline with the flexibility of a nonlinear learner to support quality-of-life assessment and stratification in patients with knee OA. This approach achieved competitive predictive accuracy while maintaining transparency and clinical interpretability, suggesting that such hybrid architectures may provide a feasible balance between conventional statistical models and complex machine learning systems. By demonstrating how coefficient-based insights can inform data-driven refinement, this study outlined a practical pathway for explainable and clinically interpretable assessment and stratification of patient-centred outcomes. Although additional validation and methodological refinement are required, the proposed framework can serve as a foundation for integrating explainable AI tools into rehabilitation planning and QoL management for digital musculoskeletal health.

Footnotes

Acknowledgements

The results of the AI learning data construction project, led by the Ministry of Science and ICT and implemented by the National Information Society Agency, utilised AI learning data.

ORCID iDs

Jaehyuk Lee

Sejun Oh

Bo Ryun Kim

Ethical considerations

This study utilized AI learning data constructed as part of the AI Learning Data Construction Project led by the Ministry of Science and ICT and implemented by the National Information Society Agency. The Institutional Review Board (IRB) of Korea University Anam Hospital approved the study protocol (IRB No.: 2022AN0110).

Author contributions

Conceptualization, J.L. and S.O.; methodology, J.L.; software, S.O.; validation, J.C.; formal analysis, J.L.; investigation, J.L. and S.O.; resources, J.C.; data curation, J.L.; writing—original draft preparation, J.L.; writing—review and editing, B.K.; visualization, J.L.; supervision, B.K. and S.L.; project administration, B.K. and S.L.; funding acquisition, B.K. and S.L.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported in part by Basic Science Research Program through the National Research Foundation of Korea (NRF) grant funded by the Ministry of Education (No. RS-2023-00275579), and in part by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. RS-2024-00336696).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

This study used de-identified hospital records of patients with knee osteoarthritis (OA) provided to the national AI Hub platform (). The dataset, titled “Resistance Exercise Prescription Data,” can be accessed through a data access application process on the AI Hub website.

References

GBD 2021 Osteoarthritis Collaborators . Global, regional, and national burden of osteoarthritis, 1990–2020 and projections to 2050: A systematic analysis for the Global Burden of Disease Study 2021. Lancet Rheumatol 2023; 5: e508–e522. https://doi.org/10.1016/S2665-9913(23)00163-7

Lee

. Prevalence and risk factors of osteoarthritis in Korea: A cross-sectional study. Medicina (Kaunas) 2024; 60: 665. https://doi.org/10.3390/medicina60040665

Leifer

Katz

Losina

. The burden of OA-health services and economics. Osteoarthr Cartil 2022; 30: 10–16. https://doi.org/10.1016/j.joca.2021.05.007

Chang

Yuan

, et al. Health-related quality of life among patients with knee osteoarthritis in Guangzhou, China: A multicenter cross-sectional study. Health Qual Life Outcomes 2023; 21: 50. https://doi.org/10.1186/s12955-023-02133-x

Dardenne

Donneau

Bruyère

. Mapping the Lequesne functional index into the EQ-5D-5L utility index in patients with knee osteoarthritis. Value Health 2024; 27: 1400–1407. https://doi.org/10.1016/j.jval.2024.06.017

Kawasaki

Muramatsu

Namba

, et al. Efficacy and safety of magnetic resonance-guided focused ultrasound treatment for refractory chronic pain of medial knee osteoarthritis. Int J Hyperthermia 2021; 38: 46–55. https://doi.org/10.1080/02656736.2021.1955982

Zheng

Cicuttini

, et al. Depression in patients with knee osteoarthritis: Risk factors and associations with joint symptoms. BMC Musculoskelet Disord 2021; 22: 40. https://doi.org/10.1186/s12891-020-03875-1

Wang

. Depression in osteoarthritis: Current understanding. Neuropsychiatr Dis Treat 2022; 18: 375–389. https://doi.org/10.2147/NDT.S346183

Joseph

McCulloch

Nevitt

, et al. The effect of interactions between BMI and sustained depressive symptoms on knee osteoarthritis over 4 years: Data from the osteoarthritis initiative. BMC Musculoskelet Disord 2023; 24: 27. https://doi.org/10.1186/s12891-023-06132-3

10.

Fong

Permar

. Change point testing in logistic regression models with interaction term. Stat Med 2015; 34: 1483–1494. https://doi.org/10.1002/sim.6419

11.

Dumitrescu

Hué

Hurlin

, et al. Machine learning for credit scoring: Improving logistic regression with non-linear decision-tree effects. Eur J Oper Res 2022; 297: 1178–1192. https://doi.org/10.1016/j.ejor.2021.06.053

12.

Loh

. Logistic regression tree analysis. In: Pham

(ed). Springer handbook of engineering statistics. Springer, 2023, pp. 593–604.

13.

Hakkoum

Abnane

Idri

. Interpretability in the medical field: A systematic mapping and review study. Appl Soft Comput 2022; 117: 108391. https://doi.org/10.1016/j.asoc.2021.108391

14.

Parimbelli

Buonocore

Nicora

, et al. Why did AI get this one wrong? Tree-based explanations of machine learning model predictions. Artif Intell Med 2023; 135: 102471. https://doi.org/10.1016/j.artmed.2022.102471

15.

Chen

Wei

, et al. Understanding heart failure patients EHR clinical features via SHAP interpretation of tree-based machine learning model predictions. AMIA Annu Symp Proc 2022, 2021.

16.

Teng

Liu

Song

, et al. A survey on the interpretability of deep learning in medical diagnosis. Multimedia Syst 2022; 28: 2335–2355. https://doi.org/10.1007/s00530-022-00960-4

17.

Borisov

Leemann

Seßler

, et al. Deep neural networks and tabular data: A survey. IEEE Trans Neural Netw Learn Syst 2024; 35: 7499–7519. https://doi.org/10.1109/TNNLS.2022.3229161

18.

Hussain

Kim

Kwon

, et al. Estimation of patient-reported outcome measures based on features of knee joint muscle co-activation in advanced knee osteoarthritis. Sci Rep 2024; 14: 12428. https://doi.org/10.1038/s41598-024-63266-7

19.

Teoh

Othmani

Goh

, et al. Deciphering knee osteoarthritis diagnostic features with explainable artificial intelligence: A systematic review. IEEE Access 2024.

20.

Hossain

Zamzmi

Mouton

, et al. Explainable AI for medical data: Current methods, limitations, and future directions. ACM Comput Surv 2025; 57: 1–46. https://doi.org/10.1145/3637487

21.

Kiadaliri

Cronström

Dahlberg

, et al. Patient acceptable symptom state and treatment failure threshold values for work productivity and activity Impairment and EQ-5D-5L in osteoarthritis. Qual Life Res 2024; 33: 1257–1266. https://doi.org/10.1007/s11136-024-03602-6

22.

Kokkotis

Moustakidis

Papageorgiou

, et al. Machine learning in knee osteoarthritis: A review. Osteoarthr Cartil Open 2020; 2: 100069. https://doi.org/10.1016/j.ocarto.2020.100069

23.

Binvignat

Pedoia

Butte

, et al. Use of machine learning in osteoarthritis research: A systematic literature review. RMD Open 2022; 8: e001998. https://doi.org/10.1136/rmdopen-2021-001998

24.

, et al. Development of machine learning models for predicting depressive symptoms in knee osteoarthritis patients. Sci Rep 2024; 14: 28603. https://doi.org/10.1038/s41598-024-79601-x

25.

Huber

Kurz

Leidl

. Predicting patient-reported outcomes following hip and knee replacement surgery using supervised machine learning. BMC Med Inform Decis Mak 2019; 19: 3. https://doi.org/10.1186/s12911-019-0759-2

26.

Castagno

Birch

van der Schaar

, et al. Predicting rapid progression in knee osteoarthritis: A novel and interpretable automated machine learning approach, with specific focus on young patients and early disease. Ann Rheum Dis 2025; 83: e1234. https://doi.org/10.1136/annrheumdis-2024-224567

27.

Fan

Song

, et al. XGBoost-SHAP-based interpretable diagnostic framework for knee osteoarthritis: A population-based retrospective cohort study. Arthritis Res Ther 2024; 26: 450. https://doi.org/10.1186/s13075-024-03450-2

28.

Farah

Murris

Borget

, et al. Assessment of performance, interpretability, and explainability in artificial intelligence–based health technologies: What healthcare stakeholders need to know. Mayo Clin Proc Digit Health 2023; 1: 120–138. https://doi.org/10.1016/j.mcpdig.2023.02.004

29.

Hua

Stead

George

, et al. Clinical risk prediction with logistic regression: Best practices, validation techniques, and applications in medical research. Academic Medicine and Surgery 2025. https://doi.org/10.62186/001c.131964

30.

Ponce‐Bobadilla

Schmitt

Maier

, et al. Practical guide to SHAP analysis: Explaining supervised machine learning model predictions in drug development. Clin Transl Sci 2024; 17: e70056. https://doi.org/10.1111/cts.70056

31.

Hur

Lee

Park

, et al. Comparison of SHAP and clinician friendly explanations reveals effects on clinical decision behaviour. npj Digit Med 2025; 8: 578. https://doi.org/10.1038/s41746-025-01958-8

32.

Abdullah

TAA

Zahid

MSM

Ali

. A review of interpretable ML in healthcare: Taxonomy, applications, challenges, and future directions. Symmetry 2021; 13: 2439. https://doi.org/10.3390/sym13122439

33.

Smolle

Goetz

Maurer

, et al. Artificial intelligence-based computer-aided system for knee osteoarthritis assessment increases experienced orthopaedic surgeons' agreement rate and accuracy. Knee Surg Sports Traumatol Arthrosc 2023; 31: 1053–1062. https://doi.org/10.1007/s00167-022-07220-y

34.

Carulli

Rossi

SMP

Magistrelli

, et al. Can artificial intelligence help orthopaedic surgeons in the conservative management of knee osteoarthritis? A consensus analysis. J Clin Med 2025; 14: 690. https://doi.org/10.3390/jcm14030690