HLPL: H epatic Evaluation for L ife P rediction based on Machine and Deep L earning

Abstract

Liver cancer survival prediction is challenged by high-dimensional, nonlinear clinical data and censoring effects. To address this, this paper proposes HLPL (Hepatic Evaluation for Life Prediction based on Machine and Deep Learning), an interpretable and scalable framework for accurate survival prediction. The proposed HLPL integrates data preprocessing, Cox-based feature selection, interpretable analysis, and machine learning (ML) and deep learning (DL) modeling. Specifically, the univariate Cox proportional hazards model is used for feature screening, while SHapley Additive exPlanations (Cox-SHAP) is applied solely for interpretability and feature refinement, without being used as a downstream model input. To capture patient heterogeneity, K-means clustering is performed exclusively on the training set, preventing data leakage. Experimental results show that the proposed HLPL framework achieves superior performance compared to conventional methods. The model attains an accuracy of 0.88. Furthermore, clustering analysis reveals distinct survival patterns across patient subgroups, while SHAP analysis identifies key prognostic factors such as tumor stage and treatment modality. These results demonstrate that HLPL provides an accurate, interpretable, and computationally efficient solution for survival prediction in liver cancer.

Keywords

Deep learning feature engineering K-means clustering liver cancer machine learning

1 Introduction

Globally, liver cancer, predominantly hepatocellular carcinoma (HCC), remains one of the leading causes of cancer-related morbidity and mortality worldwide.¹ Its incidence is closely associated with chronic liver diseases, including hepatitis B or C infection, alcohol abuse, and metabolic disorders.^2,3 Due to the complex interplay of tumor burden, liver function, and patient-specific conditions, the prognosis of HCC is highly heterogeneous, posing significant challenges for accurate survival prediction and clinical decision-making.⁴ Therefore, developing reliable and interpretable prognostic models is of critical importance for personalized treatment planning and outcome optimization.

Traditionally, survival analysis in oncology has been dominated by statistical approaches such as the Cox proportional hazards model.^5,6 Although these models provide interpretable risk estimates, they rely on assumptions such as proportional hazards and linear relationships, which may not adequately capture the complex, nonlinear interactions present in high-dimensional clinical data. Extensions such as multivariable Cox regression⁷ offer incremental improvements but remain limited in modeling intricate feature dependencies and handling large-scale heterogeneous datasets.

With the rapid advancement of artificial intelligence, ML techniques have emerged as powerful alternatives for cancer prognosis prediction. ML models⁸ have demonstrated strong performance in survival analysis,^9,10 disease progression prediction,¹¹ and risk assessment.¹² In liver cancer research, approaches based on Random Forests and Gradient Boosting have shown improved predictive accuracy over conventional clinical scoring systems.¹³ However, most ML-based studies rely on single-model architectures and often lack mechanisms to address patient heterogeneity or provide clinically meaningful interpretations.

DL further enhances predictive modeling by leveraging hierarchical feature representations to capture complex patterns in large-scale data. Recent studies have applied DL to dynamic risk prediction,^14,15 treatment response modeling,^16,17 and survival prediction,^16,18,19 achieving promising results across multiple cancer types. Despite these advances, many DL models function as “black boxes,” limiting their interpretability and hindering clinical adoption.

In the literature, studies on liver cancer prognosis prediction have increasingly utilized the combination of ML^20–28 and DL^9,29,30 techniques to enhance model performance and clinical utility. Wang et al.²⁰ applied Logistic Regression, Random Forest, and Gradient Boosting to predict postoperative outcomes in HCC. Xie et al.²¹ used these modules to predict lymph node metastasis in intrahepatic cholangiocarcinoma, identifying key clinical predictors. However, these methods typically use tree-based models for attribute selection, followed by classification and prediction. The feature selection based on the tree model usually causes the prediction results of the model to tend to the same tree model architecture, thus leading to bias in ML modeling and deficiencies in practical applications.

In addition to supervised approaches, recent research has explored unsupervised and representation learning techniques to uncover latent structures in medical and healthcare data.^31–33 These methods aim to identify intrinsic patient subgroups without relying on labeled outcomes, thereby providing insights into disease heterogeneity. However, existing methods are typically designed for clustering or pattern discovery alone and are rarely integrated with downstream survival prediction and interpretability mechanisms, limiting their effectiveness in clinical decision support.

Recent studies have also emphasized the importance of combining predictive performance with model interpretability. For instance, Salehi et al.³⁴ demonstrated the effectiveness of scalable predictive analytics in healthcare systems, while Kalita et al.³⁵ highlighted the role of SHAP-based techniques in enhancing model transparency. These developments underscore the growing demand for explainable and clinically trustworthy predictive models.

Another important challenge in clinical prediction modeling is the potential bias and limited representativeness of real-world datasets. Clinical data collected from a single institution may reflect specific demographic distributions, treatment protocols, and prescribing patterns, which could introduce systematic biases into the model. These biases may affect the generalizability of the prediction results across different populations and healthcare settings.

Despite substantial progress, several critical gaps remain in the current literature. First, most existing approaches rely on single models or loosely coupled frameworks, failing to exploit the complementary strengths of ML and DL techniques. Second, patient heterogeneity is insufficiently addressed, which limits the effectiveness of subgroup-specific predictions. Third, interpretability is often overlooked, reducing the clinical applicability of advanced models. Finally, the integration of feature selection, patient stratification, and predictive modeling into a unified framework remains underexplored.

To address the above research gaps, this study introduces a novel ensemble prediction system, named HLPL, designed to improve predictive accuracy, interpretability, and robustness for liver cancer survival analysis. The proposed HLPL integrates ML and DL models through five phases: Data Preprocessing, Feature Engineering, Cox-SHAP Analysis, Pre-clustering, and Model Development and Training (MDT). The Data Preprocessing phase ensures raw data is prepared for analysis by addressing outliers, inputting missing values, normalizing features, and transforming categorical variables into numerical formats. In the Feature Engineering phase, significant predictors are identified using Pearson and Spearman correlations to account for both linear and nonlinear relationships. The Cox-SHAP phase employs the Cox proportional hazards model to compute patient risk scores, and leverages SHAP values to interpret feature contributions, enhancing model explainability. During Pre-clustering, K-means clustering is applied to group patients into similar subpopulations, effectively addressing population heterogeneity. Lastly, the MDT phase utilizes ML and DL models trained on the clustered datasets and combines their outputs through model fusion to boost predictive performance and robustness.

In practical terms, HLPL supports real-world clinical applications by enabling risk stratification, personalized survival prediction, and treatment planning. Its interpretable design allows clinicians to identify high-risk patients, understand key prognostic factors, and compare outcomes under different therapies, thereby facilitating integration into clinical decision-support systems. The proposed HLPL framework provides a structured, interpretable, and powerful approach for liver cancer survival prediction, consistently outperforming existing methods. The main contributions of this paper are summarized as follows:

1)
Integration of Cox Proportional Hazards Model with SHAP for Interpretability: The proposed HLPL combines the Cox proportional hazards model with SHAP values. This integration provides a dual benefit: evaluating the relative risk of clinical features for survival predictions while quantifying their contributions to individual predictions. This enhances both the model's predictive accuracy and interpretability, offering reliable insights into clinical decision-making.
2)
Pre-Clustering for Enhancing Prognostic Precision. The proposed HLPL employs K-means clustering to segment liver cancer patients into homogeneous groups. This stratification approach addresses inter-patient variability and ensures that predictive models are applied to more uniform populations. By tailoring predictions to these clusters, the proposed method achieves improved sensitivity to individual patient characteristics and disease manifestations, leading to more precise and personalized prognostic assessments.
3)
Ensemble Integration of ML and DL. The proposed HLPL integrates ML and DL models using a weighted average approach to form a robust hybrid predictive framework. This integration capitalizes on the strengths of both methodologies: ML excels in processing structured clinical data, while DL is adept at capturing and interpreting complex patterns in high-dimensional datasets. By uniting these complementary approaches, the proposed HLPL significantly enhances the accuracy and reliability of liver cancer prognosis predictions, offering a powerful tool for clinical decision-making.

The remainder of the paper is structured as follows. Section 2 provides a comprehensive overview and comparison of relevant previous studies. Section 3 presents a detailed description of the assumptions, notation definitions, and problem formulation. Section 4 presents a detailed description of the proposed ML and DL mechanisms. Section 5 consists of a simulation and performance evaluation of the proposed method, and section 6 discusses the summary and future work.
2 Related work

This paper reviews the application of ML and DL for prognosis prediction in cancer. It evaluates how ML models have enhanced survival predictions and discusses the advancements brought by DL techniques in analyzing complex clinical data to improve the accuracy of cancer prognostics.

2.1 ML in cancer prediction

In the evolving field of ML applications for predicting prognosis in liver cancer, several studies have progressively addressed the complexities of this challenge. Ji et al.²³ set the stage by using a gradient boosting machine (GBM) to predict disease-specific survival for early hepatocellular carcinoma (EHCC) patients, achieving a notable C-statistic over 0.72. Despite its effectiveness, this model's reliance on a single predictive approach raised concerns about its ability to capture diverse patient variations, potentially limiting its broader applicability.

Building on the need for a more nuanced approach, Li et al.²⁴ developed a machine learning-based prognostic model using lysosome-related genes to predict the prognosis and immune status of HCC patients. The model, which incorporates eight key genes identified through Random Survival Forest (RSF) and Lasso regression, demonstrated superior predictive accuracy with a high C-index. Despite its strengths, the study highlighted the need for further validation of the remaining genes and called for in vivo experiments to explore their roles in HCC. This research emphasizes the potential of integrating various predictive models but also points to the necessity of pre-clustering techniques to enhance generalizability and robustness in clinical settings.

Huang et al.²⁵ further refined the predictive modeling by employing XGBoost and a comprehensive dataset to analyze recurrence of HCC post-surgery. Their approach demonstrated significant advancements in subpatterns within patient subgroups to further enhance model accuracy. Each of these studies has significantly advanced the field but also left gaps that offer opportunities for further enhancement. The current research aims to bridge these gaps by employing an integrated approach that combines ML and DL techniques. Utilizing a weighted ensemble model, this approach incorporates the strengths of various predictive models, such as those used by Ji et al., Li et al., and Huang et al. Additionally, it introduces pre-clustering to systematically identify and utilize subtle patient subgroup characteristics before model application. This holistic approach is designed to maximize the predictive accuracy and generalizability of survival time prognostics for liver cancer patients, ensuring that all aspects of patient variability are effectively captured and utilized in prognosis prediction.

2.2 Deep learning in cancer prediction

In the advancement of predictive modeling for cancer, diverse approaches have demonstrated the potential of deep learning. Huang et al.²⁶ explored the integration of deep learning models—specifically Cox-nnet, DeepSurv, and AECOX—with the Cox proportional hazards model to enhance the prediction of liver cancer prognosis using RNA-seq data from The Cancer Genome Atlas. Their approach adeptly manages complex, high-dimensional transcriptomic data, highlighting DL's capability in handling such intricacies. However, they noted variability in predictive performance across different datasets, suggesting a need for models that can consistently perform well across diverse clinical scenarios.

Further extending the utility of DL, Bhambhvani et al.²⁷ advanced the application of deep learning by focusing on pediatric genitourinary rhabdomyosarcoma. They employed deep neural networks (DNNs) to predict 5-year survival rates using data from the SEER database. This demonstrated DL's effectiveness in accurately predicting cancer survival. However, a significant oversight in their study was the lack of external validation, which is crucial for verifying the applicability of their predictive models to broader clinical settings, including different cancer types such as liver cancer. This gap emphasized the critical role that external validations play in ensuring the reliability and generalizability of prognostic models.

Lai et al.²⁸ implemented a DNN to predict overall survival in non-small cell lung cancer patients by integrating microarray gene expression data with clinical information. Utilizing a systems biology approach, their study identified fifteen prognostic biomarkers that informed the DNN model, demonstrating the model's ability to handle complex, high-dimensional datasets effectively. This integration via bimodal learning highlights the potential of deep learning in enhancing survival predictions through the synergistic use of heterogeneous data types. Despite these advances, the study does not explore pre-clustering techniques that could potentially enhance the identification of intrinsic patient group characteristics before model application. Additionally, the integration of predictions from various models through methodologies such as weighted averaging, which could further stabilize and refine predictive performance across diverse clinical scenarios, is not discussed. These areas represent potential avenues for further research to optimize predictive accuracy and model robustness in the field of cancer prognosis.

Table 1 presents a comparative summary of the proposed HLPL framework and the related studies. The “methodology” column indicates the predictive models employed in each study, while “data type” specifies the nature of the input data used for prognosis prediction. The columns “interpretability,” “patient stratification,” and “model integration” denote whether the corresponding methods consider model explainability, patient subgroup analysis, and the integration of multiple models, respectively. The “limitations” column highlights the main shortcomings of each approach. Compared with the related works, most existing studies rely on single-model architectures and lack mechanisms for interpretability, patient stratification, and comprehensive model integration. In contrast, the proposed HLPL framework integrates ML and DL models through an ensemble strategy, incorporates SHAP-based interpretability, and employs pre-clustering to address patient heterogeneity. These advantages enable HLPL to provide more accurate, robust, and clinically interpretable predictions for liver cancer prognosis.

Table 1.
The comparisons of the proposed HLPL and related work.

Study Methodology Data type Interpretability Patient stratification Model integration Limitations

Ji et al.²³ GBM Clinical × × × Single model

Li et al.²⁴ RSF + Lasso Genomic × × Partial No clustering

Huang et al.²⁵ XGBoost Clinical × × × No integration

Huang et al.²⁶ DL RNA-seq × × × Dataset variability

Bhambhvani et al.²⁷ DNN SEER × × × No external validation

Lai et al.²⁸ DNN + multimodal Clinical + gene × × × No clustering

Proposed HLPL ML + DL Ensemble Multi-source √ √ √ —

Study	Methodology	Data type	Interpretability	Patient stratification	Model integration	Limitations
Ji et al.²³	GBM	Clinical	×	×	×	Single model
Li et al.²⁴	RSF + Lasso	Genomic	×	×	Partial	No clustering
Huang et al.²⁵	XGBoost	Clinical	×	×	×	No integration
Huang et al.²⁶	DL	RNA-seq	×	×	×	Dataset variability
Bhambhvani et al.²⁷	DNN	SEER	×	×	×	No external validation
Lai et al.²⁸	DNN + multimodal	Clinical + gene	×	×	×	No clustering
Proposed HLPL	ML + DL Ensemble	Multi-source	√	√	√	—

3 Assumption and problem statement

This section describes the definitions of notations, assumptions, and constraints utilized in this study, along with the formulation of the problem being addressed.

3.1 System model and notation definitions

The objective of this study is to analyze and model the determinants of survival duration in liver cancer patients. A dataset of liver cancer patients is denoted as $D = ({\tilde{X}}_{(q + 1) \times n}, {\tilde{Y}}_{(q + 1)})$ , where ${\tilde{X}}_{q \times n}$ and ${\tilde{Y}}_{q \times 1}$ represent the collected data and their corresponding labels, respectively. Assume that table $\tilde{X}$ consists of q $\times$ n elements as shown in Equation (1).

\begin{aligned} \tilde{X} = [\begin{matrix} {\tilde{X}}_{0} \\ ⋮ \\ {\tilde{X}}_{q} \end{matrix}] = [\begin{array}{ccc} {\tilde{x}}_{0, 1} & \dots & {\tilde{x}}_{0, n} \\ ⋮ & ⋱ & ⋮ \\ {\tilde{x}}_{q, 1} & \dots & {\tilde{x}}_{q, n} \end{array}] \end{aligned}

(1)

where Table

\tilde{X}

contains n columns, each representing a field describing a patient (e.g., tumor size, smoking history, alcohol consumption, active treatment status, etc.). This table has collected data on q patients, with each patient's data represented by a row in the table. Therefore, Table

\tilde{X}

has a total of q + 1 rows, with one row for the field names and the other q rows for the data of q patients. Let

{\tilde{X}}_{0} = [{\tilde{x}}_{0, 1}, \dots, {\tilde{x}}_{0, n}]

denote the set of n fields, where

{\tilde{x}}_{0, i}

represents the i-th field name in

{\tilde{X}}_{0} .

Similarly, let

{\tilde{X}}_{i} = [{\tilde{x}}_{i, 1}, \dots, {\tilde{x}}_{i, n}]

denote the data of the i–th patient,

1 \leq i \leq q

, where

{\tilde{x}}_{i, k}

denote the k-th element in the i-th row,

1 \leq k \leq q .

In addition to the dataset $\tilde{X}$ , it is assumed that the survival age of each patient has also been collected. The label data $\tilde{Y}$ can be represented $\tilde{Y} = [{\tilde{y}}_{0}, \dots, {\tilde{y}}_{q}]$ , where ${\tilde{y}}_{0}$ represents the field name of survival age, and ${\tilde{y}}_{k}$ represents the survival age of the k-th patient for $1 \leq k \leq n$ .

Based on the above assumptions, several research hypotheses are formulated to guide the design of the proposed HLPL framework. Specifically, this study hypothesizes that clinical features are significantly associated with survival outcomes, that patient stratification can effectively reduce inter-patient heterogeneity, and that the integration of ML and DL models improves predictive performance. These hypotheses are grounded in prior clinical and methodological studies.

3.2 Objective

The objective of this study is to analyze and develop the best model, say $M^{b e s t}$ , aiming to accurately predict survival age ${\tilde{y}}_{m}$ for a given liver cancer patient with collected data ${\tilde{X}}_{m} = [{\tilde{x}}_{m, 1}, \dots, {\tilde{x}}_{m, n}]$ . Although the investigated problem predicts the survival time of liver cancer patients, this problem can be transformed into a classification problem. Specifically, the survival time can be categorized into several classes, such as one year, two years, two to five years, five to ten years, and more than ten years. Let the total number of survival time categories be classified into z classes. That is, the output of model M can be expressed by $Y^{M}$ = $[y_{0}, \dots, y_{z}]$ , where $y_{i}$ denote the i-th classes of z classes. Let notations $δ_{i, j}$ be a Boolean variable representing the actual outcome of class $y_{j}$ , for an input sample $X_{i} = [x_{i, 1}, \dots, x_{i, n}]$ _. That is,

δ_{i, j} = {\begin{cases} 1, & if label of X_{i} is class y_{j}, \\ 0, & otherwise . \end{cases}

Similarly, let ${\hat{δ}}_{i, j}$ denote a Boolean variable representing the predicted result of class $y_{j}$ for an input sample ${\tilde{X}}_{i} = [{\tilde{x}}_{i, 1}, \dots, {\tilde{x}}_{i, n}]$ by applying prediction model M. That is,

{\hat{δ}}_{i, j} = {\begin{cases} 1, & if the prediction of X_{i} is y_{j}, \\ 0, & otherwise . \end{cases}

Consider applying the model M to predict the result for an input sample $X_{i}$ , let $T P_{i, j}$ , $F P_{i, j}$ and $F N_{i, j}$ denote the True Positive, False Positive, and False Negative outcomes for class $y_{j}$ predicted by model M, respectively. The values of $T P_{i, j}$ , $F P_{i, j}$ and $F N_{i, j}$ can be measured by Equations (2) to (4), respectively.

\begin{aligned} T P_{i, j} & = δ_{i, j} \times {\hat{δ}}_{i, j}, \end{aligned}

(2)

\begin{aligned} F P_{i, j} & = (1 - δ_{i, j}) \times {\hat{δ}}_{i, j}, and \end{aligned}

(3)

\begin{aligned} F N_{i, j} & = δ_{i, j} \times (1 - {\hat{δ}}_{i, j}) . \end{aligned}

(4)

Let $T P_{i}$ , $F P_{i}$ and $F N_{i}$ denote the prediction result of all classes $1 \leq y_{j} \leq z$ for an input sample $X_{i}$ _. That is:

\begin{aligned} T P_{i} & = \sum_{j = 1}^{z} T P_{i, j}, \end{aligned}

(5)

\begin{aligned} F P_{i} & = \sum_{j = 1}^{z} F P_{i, j}, \end{aligned}

(6)

\begin{aligned} F N_{i} & = \sum_{j = 1}^{z} F N_{i, j} . \end{aligned}

(7)

Assume that there are a total of r samples predicted by model M. Let $T P$ , $F P$ , and $F N$ denote the true positive, true negative, false positive, and false negative of model M, respectively. The values of $T P$ , $F P,$ and $F N$ can be derived by Equations (8) to (10), respectively.

\begin{aligned} T P & = \sum_{i = 1}^{r} \sum_{j = 1}^{z} T P_{i, j}, \end{aligned}

(8)

\begin{aligned} F P & = \sum_{i = 1}^{r} \sum_{j = 1}^{z} F P_{i, j}, \end{aligned}

(9)

\begin{aligned} F N & = \sum_{i = 1}^{r} \sum_{j = 1}^{z} F N_{i, j} . \end{aligned}

(10)

Let $R e c a l l^{M}$ , $P r e^{M}$ and $F_{1}^{M}$ denote the recall, precision, and F1-score of model M, respectively. The values of $R e c a l l^{M}$ , $P r e^{M}$ and $F_{1}^{M}$ can be derived from Equations (11) to (13).

\begin{aligned} R e c a l l^{M} & = \frac{T P}{T P + F N}, \end{aligned}

(11)

\begin{aligned} P r e^{M} & = \frac{T P}{T P + F P}, \end{aligned}

(12)

\begin{aligned} F_{1}^{M} & = \frac{2 * P r e^{M} * R e c a l l^{M}}{P r e^{M} + R e c a l l^{M}} . \end{aligned}

(13)

Let $M$ denote the set of all possible models M for predicting the r samples $X_{i}$ , $1 \leq i \leq r$ . The objective of this paper is to design the best model $M^{b e s t}$ which satisfies Equation (14).

\begin{aligned} M^{b e s t} = \arg max_{M \in M} F_{1}^{M} . \end{aligned}

(14)

3.3 Constraints

The proposed HLPL is subject to several important constraints, which are presented as follows.

1) Data Validity Constraints

All features ${\tilde{x}}_{i, j}$ and survival times ${\tilde{y}}_{k}$ must be non-negative, reflecting the real-world nature of clinical data where variables such as tumor size, smoking duration, and patient age cannot be negative. This constraint ensures that:

{\tilde{x}}_{i, j} \geq 0, {\tilde{y}}_{k} \geq 0 for all 1 \leq i \leq q, 1 \leq j \leq n, 1 \leq k \leq n .

2) Classification Constraints

The survival time is categorized into z distinct classes, ensuring that the classification problem is well-defined. The number of classes is constrained to be a positive integer, where $z \in Z$ and $z \geq 2.$ For probabilistic models, the sum of the predicted probabilities for all classes must equal one, ensuring a valid probability distribution, presented as Equation (15).

\begin{aligned} \sum_{j = 1}^{z} P (Y_{i} = y_{j} ∣ X_{i}) = 1 for all 1 \leq i \leq r . \end{aligned}

(15)

These constraints ensure that the data used in the model is valid and complete, and that the classification model produces interpretable and reliable results.

4 The proposed HLPL mechanism

The primary objective of this study is to develop a predictive model, denoted as $M^{b e s t}$ . This model is designed to accurately estimate the survival duration $y_{m}$ for liver cancer patients based on the collected clinical data ${\tilde{X}}_{m} = [{\tilde{x}}_{m, 1}, \dots, {\tilde{x}}_{m, n}]$ . Given the complex and high-dimensional nature of the dataset $\tilde{D} = (\tilde{X}, \tilde{Y})$ . Traditional statistical methods often prove inadequate in capturing nonlinear relationships and intricate patterns inherent in such data. To address these challenges, this study proposes an ensemble prediction system known as HLPL. The proposed HLPL mainly consists of five phases: Data Preprocessing, Feature Engineering, Cox-SHAP, Pre-clustering, and MDT. The Data Preprocessing phase ensures raw data is suitable for training by handling outliers, imputing missing values, standardizing features, and encoding categorical variables. In the Feature Engineering phase, relevant predictors are identified by analyzing linear and nonlinear relationships through Pearson and Spearman correlation coefficients. The Cox-SHAP phase calculates patient risk scores using the Cox proportional hazards model and applies SHAP values to interpret the contributions of individual features. The Pre-clustering phase employs K-means clustering to group patients based on their clinical profiles, capturing latent data patterns. Finally, the MDT phase trains ML and DL models on the processed and clustered dataset, leveraging model fusion to enhance predictive accuracy and robustness. This systematic approach ensures a robust, interpretable, and effective model for liver cancer survival prediction. The following presents the details of each phase.

4.1 Data preprocessing phase

This phase focuses on transforming the collected dataset to prepare it for model training. Let $\tilde{X} = {{\tilde{x}}_{i, j}}$ and $\tilde{Y} = {{\tilde{y}}_{i, j}}$ denote the input dataset and labels, respectively. It consists of five steps: Outlier Detection and Handling, Missing Value Imputation, Feature Scaling and Standardization, Categorical Feature Encoding, and Label Transformation. Each preprocessing step systematically refines the elements of $\tilde{X}$ and $\tilde{Y}$ , resulting in cleaned, scaled, and encoded data ready for subsequent analysis.

(1) Outlier Detection and Handling

The raw input data $\tilde{X} = {{\tilde{x}}_{i, j}}$ may contain outliers that can distort model training. Let $Z_{i, j}$ denote the Z-score of ${\tilde{x}}_{i, j}$ . The value of $Z_{i, j}$ can be derived from Equation (16).

\begin{aligned} Z_{i, j} = \frac{{\tilde{x}}_{i, j} - μ {\tilde{X}}_{j}}{σ {\tilde{X}}_{j}}, \end{aligned}

(16)

where

μ {\tilde{X}}_{j}

and

σ {\tilde{X}}_{j}

represent the mean and standard deviation of the j-th feature in

\tilde{X}

, respectively. If

| Z_{i, j} | > τ

{\tilde{x}}_{i, j}

is considered an outlier and is either capped or removed. If capping is used,

{\tilde{x}}_{i, j}

is transformed into a new value

x_{i, j}

as:

\begin{aligned} x_{i, j} = m i n (m a x ({\tilde{x}}_{i, j}, μ_{j} - τ σ_{j}), μ_{j} + τ σ_{j}), μ {\tilde{X}}_{j} + τ \cdot σ {\tilde{X}}_{j} . \end{aligned}

(17)

Otherwise, if the row is removed, ${\tilde{x}}_{i, j}$ and ${\tilde{y}}_{i, j}$ are discarded, and no corresponding $x_{i, j}$ or $y_{i, j}$ exists in the final dataset.

(2) Missing Value Imputation

Missing values in $\tilde{X}$ (denoted as ${\tilde{x}}_{i, j} = NaN$ ) are imputed using KNN-based imputation. For each missing ${\tilde{x}}_{i, j}$ , the imputed value ${\tilde{x}}_{i, j}^{imputed}$ is calculated using the values of its k nearest neighbors. That is

\begin{aligned} {\tilde{x}}_{i, j}^{imputed} = \frac{1}{k} \sum_{l = 1}^{k} {\tilde{x}}_{l, j} . \end{aligned}

(18)

The imputed value replaces the missing value in the dataset. That is,

\begin{aligned} x_{i, j} = {\tilde{x}}_{i, j}^{imputed} if {\tilde{x}}_{i, j} = NaN . \end{aligned}

(19)

This process ensures that all missing entries in $\tilde{X}$ are transformed into valid values in X, while $\tilde{Y} = Y$ remains unchanged.

(3) Feature Scaling and Standardization

To ensure uniformity across features, each feature ${\tilde{x}}_{i, j}$ in $\tilde{X}$ is standardized:

\begin{aligned} x_{i, j} = \frac{{\tilde{x}}_{i, j} - μ {\tilde{X}}_{j}}{σ {\tilde{X}}_{j}} . \end{aligned}

(20)

This transformation guarantees that each feature in $X = {x_{i, j}}$ has a mean of zero and a standard deviation of one, making them suitable for algorithms sensitive to scale.

(4) Categorical Feature Encoding:

Categorical variables in $\tilde{X}$ are converted into numerical representations through one-hot encoding. For a categorical feature ${\tilde{C}}_{j}$ , its one-hot encoded form replaces the original categorical value with binary indicators. Each categorical value is mapped to a vector of zeros and ones:

\begin{aligned} x_{i, j} = O H E ({\tilde{C}}_{i, j}), \end{aligned}

(21)

where

O H E

represents the one-hot encoded matrix. This process converts

\tilde{X}

into a numerical form suitable for modeling.

(5) Label Transformation:

Labels $\tilde{Y} = {{\tilde{y}}_{i, j}}$ are also adjusted during preprocessing, ensuring compatibility with the model's output format. In cases where labels represent classes or categories, they can be encoded similarly to categorical features using a one-hot encoding approach. Thus, for each original label ${\tilde{y}}_{i, j}$ , the transformed label $y_{(i, j)}$ may be expressed as Equation (22).

\begin{aligned} y_{(i, j)} = {\begin{cases} 1 & if {\tilde{y}}_{i, j} = target class, \\ 0 & otherwise . \end{cases} \end{aligned}

(22)

After applying all these transformations, the dataset $\tilde{D} = (\tilde{X}, \tilde{Y})$ is fully transformed into $D = (X, Y)$ , where $X = {x_{i, j}}$ represents the cleaned, scaled, and encoded input data, and $Y = {y_{i, j}}$ corresponds to the processed labels, ready for model training.

4.2 Feature engineering phase

The Feature Engineering phase aims to identify the most relevant features for predicting the survival duration of liver cancer patients. Due to the presence of right-censored observations in survival data, traditional correlation-based metrics (e.g., Pearson or Spearman correlations) are not suitable, as they fail to account for censoring and may lead to biased feature selection. To address this issue, this phase employs univariate Cox proportional hazards regression, which explicitly models time-to-event data and properly incorporates censoring information, enabling a statistically sound assessment of feature relevance.

Following data preprocessing, let $X \in R^{n \times p}$ denote the feature matrix derived from the transformed dataset $\tilde{X}$ , where n and p denote the numbers of patients and features, respectively. Let $x_{i, k}$ denote the value of the $k_{t h}$ feature for the $i_{t h}$ patient. $X_{k} \in X$ denotes $k_{t h}$ feature.

To evaluate the individual prognostic value of each feature, a univariate Cox proportional hazards model³⁶ is fitted for each feature $X_{k} \in X$ . The hazard function for the $i_{t h}$ patient based on a single feature $X_{k}$ is expressed by Equation (23).

\begin{aligned} h_{i} (t | x_{i, k}) = h_{0} (t) \exp (β_{k} x_{i, k}), \end{aligned}

(23)

where

h_{0} (t)

is the baseline hazard function,

β_{k}

is the estimated regression coefficient for the feature

X_{k}

While the hazard function characterizes the relationship between the feature and survival risk, feature selection is performed based on the statistical significance of $β_{k}$ rather than the hazard value itself. Specifically, the Wald test is used to compute the p-value $p_{k}$ for each coefficient. Features with smaller p-values indicate a significant relationship with patient survival time and event status.

After calculating the univariate Cox models for all features, the final set of important features, $X_{i m p}$ , is determined by selecting features whose p-values fall below a predefined statistical significance threshold. That is:

\begin{aligned} X_{i m p} = {X_{k} \in X | p_{k} < α}, \end{aligned}

(24)

Where $α$ is the predefined significance threshold.

It is important to note that this univariate Cox-based screening step focuses on identifying survival-relevant features rather than modeling complex nonlinear relationships. The modeling of nonlinear effects and higher-order feature interactions is handled in the subsequent ML stage. The resulting feature subset $X_{i m p}$ provides a survival-aware and statistically grounded foundation for downstream modeling, improving both computational efficiency and model interpretability.

4.3 Cox-SHAP analysis phase

Till now, the collected dataset has undergone preprocessing, and the most critical features have been identified. The selected feature set is:

\begin{aligned} F^{*} (X) = {X_{1}^{f}, X_{2}^{f}, \dots, X_{m}^{f}}, X_{m}^{f} \in X_{i m p}, \end{aligned}

(25)

where m represents the selected features based on the predefined p-value threshold

α

. These selected features are then used as inputs for the Cox proportional hazards model.

This phase aims to enhance both predictive accuracy and interpretability. Specifically, the Cox proportional hazards model is employed for feature refinement to obtain $F^{Cox}$ , while SHAP values are used exclusively for post-hoc interpretability. The SHAP values are not used as inputs to downstream ML or DL models. While the Cox model quantifies the relative risk for each feature, SHAP values further clarify how each feature's contribution to individual predictions. Together, these tools provide a transparent framework that not only identifies critical risk factors but also enables healthcare professionals to make more informed and precise treatment decisions based on the factors most influencing patient survival outcomes.

Let D denote the output dataset from the Feature Engineering phase. D is a two-dimensional matrix represented as Equation (26).

\begin{aligned} \begin{aligned} \begin{array}{cccc} X_{1}^{f}, & \dots & X_{m}^{f} & Y \end{array} \\ D & = [\begin{array}{ccc} x_{1, 1} & \dots & x_{1, m} \\ ⋮ & ⋱ & ⋮ \\ x_{n, 1} & \dots & x_{n, m} \end{array}] \begin{matrix} y_{1} \\ ⋮ \\ y_{n} \end{matrix} \end{aligned} \end{aligned}

(26)

where

X_{1}^{f}, X_{2}^{f}, \dots, X_{m}^{f}

are the selected m feature names and

d_{j} =

(

x_{j, 1}

x_{j, 2}

,…,

x_{j, m}

) represents the cancer data of j-th patient. The Cox model estimates the hazard function

h (t | X_{i})

for i -th patient at time t, based on their feature set

X_{i} = (x_{i, 1}, x_{i, 2}, \dots, x_{i, m})

, where m represents the number of features in

F^{*} (P (\tilde{X}))

. The hazard function is expressed as Equation (27).

\begin{aligned} h (t | X_{i}) = h_{0} (t) \exp (\sum_{j = 1}^{p} β_{j} x_{i, j}), \end{aligned}

(27)

where

h_{0} (t)

is the baseline hazard, and

β_{j}

represents the regression coefficient for the j-th feature

x_{i, j}

. The coefficients

β_{j}

are estimated by maximizing the partial likelihood function, defined as Equation (28).

\begin{aligned} L (β) = \prod_{i = 1}^{n} {[\frac{\exp (\sum_{j = 1}^{p} β_{j} x_{i, j})}{\sum_{l \in R (t_{i})} \sum_{j = 1}^{p} β_{j} x_{l, j}}]}^{δ_{i}}, \end{aligned}

(28)

where

R (t_{i})

is the set of patients still at risk at time

t_{i}

, and

δ_{i}

is the event indicator (1 if an event occurs for patient i at time

t_{i}

, and 0 otherwise). Once

β_{j}

values are estimated, the hazard ratio for each feature

X_{j}

is given by Equation (29).

\begin{aligned} {HR}_{j} = \exp (β_{j}) . \end{aligned}

(29)

The hazard ratio ${HR}_{j}$ provides an estimate of the relative risk associated with each feature, highlighting the features with significant impacts on survival. After applying the Cox model, the refined set of significant features is denoted as Equation (30).

\begin{aligned} F^{Cox} = {X_{j 1}, X_{j 2}, \dots, X_{j q}}, \end{aligned}

(30)

where q represents the number of features retained after the Cox model analysis.

To enhance interpretability, SHAP values³⁷ are computed for each feature in $F^{Cox}$ . Let $ϕ_{j}$ denote the SHAP value for a feature $X_{j}$ . The value of $ϕ_{j}$ can be derived from Equation (31).

\begin{aligned} ϕ_{j} = \sum_{S \subseteq {1, \dots, p} ∖ {j}} \frac{| S |! (p - | S | - 1)!}{p!} [f (S \cup {j}) - f (S)], \end{aligned}

(31)

where S is a subset of features excluding

X_{j}

and

f (S)

represents the model's prediction using only features in S. The SHAP values

ϕ_{j}

quantify both the individual contribution and interaction effects of each feature

X_{j}

on the model's predictions.

The final output from this phase is the SHAP values for each feature in $F^{Cox}$ , which provide an interpretable explanation of how each feature influences the survival prediction. These SHAP values, alongside the Cox model's hazard ratios, ensure that the model's predictions are both accurate and transparent, facilitating clinical decision-making.

The integration of the Feature Engineering phase, Cox proportional hazards model, and SHAP values forms a coherent and interpretable pipeline for survival analysis. The Feature Engineering phase outputs a feature set $F^{*} (P (\tilde{X})) = {X_{k 1}, X_{k 2}, \dots, X_{k m}}$ , which is refined through the Cox model into a final set of significant features $F^{Cox} = {X_{j 1}, X_{j 2}, \dots, X_{j q}}$ . SHAP values are then calculated for features in $F^{Cox}$ , providing an interpretable framework for understanding how each feature affects individual predictions.

This workflow ensures a robust and transparent approach to predicting liver cancer patient survival, where both the statistical significance of features and their individual contributions to the prediction are clear. The combined model not only predicts survival outcomes but also provides insights into which features are most influential, thus supporting clinical decision-making. Thus, the Cox-SHAP module functions as an interpretable feature selection mechanism before model training, rather than a feature transformation step for downstream prediction models.

4.4 Pre-clustering phase

Unlike conventional clustering used purely for data partitioning, the goal of K-means in this study is to identify latent patient subgroups with distinct clinical characteristics and survival patterns. These clusters may implicitly correspond to clinically meaningful categories, such as differences in tumor stage, treatment modality, or liver function status.

This phase employs the K-means clustering algorithm to group patients based on the Cox-selected feature set $F^{Cox} = {X_{j 1}, X_{j 2}, \dots, X_{j q}}$ . This set contains q clinical features identified as having a significant influence on survival.

The input feature matrix $F^{Cox}$ represents feature vectors consisting of q selected features for all n patients. The value of $F^{Cox}$ can be defined as Equation (32).

\begin{aligned} F^{Cox} = {X_{1}^{COX}, X_{2}^{COX}, \dots, X_{n}^{COX}}, \end{aligned}

(32)

where

X_{i}^{COX}

denotes the feature vector of i -th patient. The value of

X_{i}^{COX}

can be derived from Equation (33).

\begin{aligned} X_{i}^{COX} = (x_{i, j 1}, x_{i, j 2}, \dots, x_{i, j q}) . \end{aligned}

(33)

Using these feature vectors, the K-means algorithm partitions the patients into k distinct clusters. The clustering process seeks to minimize the within-cluster sum of squares (WCSS),³⁸ where the goal is to assign each patient to a cluster such that the total distance between each patient's feature vector and the centroid of their respective cluster is minimized. The WCSS is mathematically expressed as Equation (34).

\begin{aligned} WCSS = \sum_{j = 1}^{k} \sum_{X_{i} \in C_{j}} ‖ X_{i}^{Cox} - μ_{j} ‖^{2}, \end{aligned}

(34)

where

μ_{j}

is the centroid of the j-th cluster, calculated as Equation (35).

\begin{aligned} μ_{j} = \frac{1}{| C_{j} |} \sum_{X_{i} \in C_{j}} X_{i}^{Cox}, \end{aligned}

(35)

The K-means algorithm iteratively assigns each patient $X_{i}^{Cox}$ to the nearest centroid and updates the centroid positions until convergence is reached, i.e., until the cluster memberships stabilize. The optimal number of clusters k is determined using the elbow method, which plots WCSS against different values of k to identify the point of diminishing returns.

Once the clustering process is complete, each patient is assigned a cluster label $C_{i}$ , corresponding to the cluster to which they belong. These cluster labels are appended to the original feature matrix $F^{Cox}$ , creating an augmented dataset that incorporates both the selected clinical features and the newly assigned cluster memberships. This augmented dataset can be represented as Equation (36).

\begin{aligned} F^{Cox} \cup {C_{1,} C_{2}, \dots, C_{n}}, \end{aligned}

(36)

where each pair

(X_{i}^{Cox}, C_{i})

represents a patient's clinical feature vector combined with their assigned cluster label. The final output of the Pre-clustering phase is the dataset

D_{clustered}

, defined as Equation (37).

\begin{aligned} D_{clustered} = {(X_{1}^{Cox}, C_{1}), (X_{2}^{Cox}, C_{2}), \dots, (X_{n}^{Cox}, C_{n})} . \end{aligned}

(37)

This enriched dataset integrates both the detailed clinical features and the cluster memberships, forming a comprehensive basis for the subsequent modeling phase. By including the cluster information $C_{i}$ , the model can leverage not only individual patient features but also patterns captured by the clustering, which may highlight underlying group structures within the patient population. This approach aims to enhance the model's predictive accuracy and generalizability in forecasting survival outcomes for liver cancer patients.

In the next phase, the predictive model is trained on the clustered dataset $D_{clustered}$ . The integration of cluster information alongside the selected features enables the model to capture complex relationships that may not be immediately apparent from clinical features alone, ultimately improving the prediction of survival outcomes.

4.5 MDT phase

Following the Pre-clustering phase, the dataset $D_{clustered} = {(X_{i}^{Cox}, C_{i})}_{i = 1}^{n}$ is ready for model training. In this dataset, $X_{i}^{Cox}$ denotes the clinical features selected via the Cox proportional hazards model, and $C_{i}$ represents the cluster label assigned to i-th patient.

This phase employs specific ML and DL algorithms to predict the survival duration $Y_{i}$ of liver cancer patients using the processed and clustered dataset. The ML algorithms include Support Vector Machines (SVM) for robust classification, Random Forest for ensemble-based feature aggregation, and Gradient Boosting for sequential error correction. The SVM model³⁹ aims to find the optimal hyperplane separating data points into distinct classes, represented as Equation (38).

\begin{aligned} f (X_{i}^{Cox}) = w^{⊤} X_{i}^{Cox} + b, \end{aligned}

(38)

where w denotes the weight vector, and b is the bias term. The optimization problem to find

w

and b is formulated as Equation (39).

\begin{aligned} min_{w, b} \frac{1}{2} ‖ w ‖^{2}, s u b j e c t t o y_{i} (w^{⊤} X_{i}^{Cox} + b) \geq 1 - ξ_{i}, \end{aligned}

(39)

where

ξ_{i}

represents the slack variables for misclassification.

The Random Forest model aggregates predictions from multiple decision trees $T_{j} (X_{i}^{Cox})$ , with the final prediction presented as Equation (40).

\begin{aligned} {\hat{Y}}_{i}^{RF} = \frac{1}{m} \sum_{j = 1}^{m} T_{j} (X_{i}^{Cox}), \end{aligned}

(40)

where m is the number of trees in the forest.

Gradient Boosting iteratively refines predictions by building models sequentially, each correcting the errors of the previous one, described as Equation (41).

\begin{aligned} F_{j + 1} (X_{i}^{Cox}) = F_{j} (X_{i}^{Cox}) + γ_{j} h_{j} (X_{i}^{Cox}), \end{aligned}

(41)

where

h_{j} (X_{i}^{Cox})

is the j-th weak learner, and

γ_{j}

is the step size determined by minimizing a loss function.

The DL algorithm is a fully connected DNN designed to capture non-linear patterns in the high-dimensional clinical data, with activations at layer $l + 1$ are calculated as Equation (42).

\begin{aligned} h^{(l + 1)} = σ (W^{(l)} h^{(l)} + b^{(l)}), \end{aligned}

(42)

where

W^{(l)}

and

b^{(l)}

are the weight matrix and bias vector for layer l, and

σ

is the activation function. The output layer provides the final prediction as Equation (43).

\begin{aligned} {\hat{Y}}_{i}^{DL} = W^{(L)} h^{(L)} + b^{(L)}, \end{aligned}

(43)

where L denotes the final layer in the network.

To capitalize on the complementary strengths of the ML and DL models, a model fusion strategy is implemented. The predictions from the ML model ${\hat{Y}}_{i}^{ML}$ and the DL model ${\hat{Y}}_{i}^{DL}$ are combined using a weighted average Equation (44).

\begin{aligned} {\hat{Y}}_{i} = α {\hat{Y}}_{i}^{ML} + (1 - α) {\hat{Y}}_{i}^{DL}, \end{aligned}

(44)

where

α

is a weighting factor optimized through cross-validation to enhance the overall prediction accuracy.

The dataset is split into training and testing subsets, with $ρ$ denoting the ratio of the total dataset. The training set $X_{train}$ consists of $ρ \times n$ patients, which represents a proportion of the total dataset, while the test set $X_{test}$ includes the remaining $(1 - ρ) \times n$ patients, reserved for model evaluation. Ensuring $0 < ρ < 1$ guarantees that both subsets are non-empty, allowing for effective training and testing phases.

The predictive model $M$ is trained using the training set $X_{train}$ , mapping the augmented feature vectors $(x_{j, 1} \dots x_{j, p}, C_{j})$ to predicted outcomes ${\hat{Y}}_{j}$ as Equation (45).

\begin{aligned} {\hat{Y}}_{j} = M_{train} (x_{j, 1} \dots x_{j, p}, C_{j}), \end{aligned}

(45)

where

1 \leq j \leq ρ \times n

indexes the patients in the training set. The inclusion of cluster information

C_{j}

enhances the model's ability to capture additional patterns not apparent from clinical features alone, thereby improving predictive power.

Upon completion of the training, the model's performance is evaluated by applying $M_{train}$ to the test set $X_{test}$ , which the model has not seen during training. This evaluation is crucial for assessing the model's generalizability and accuracy in predicting survival times for new liver cancer patients. The relationship ${\hat{Y}}_{j} = M_{train} (x_{j, 1} \dots x_{j, p}, C_{j})$ is central to the methodology, integrating the effects of clinical features and cluster memberships. By applying $M_{train}$ to the unseen data, the model's generalizability is assessed, determining its potential to provide accurate survival time predictions for liver cancer patients.

4.6 Algorithm of the proposed HLPL

The overall solution workflow of the proposed HLPL mechanism can be summarized in Table 2. This algorithm outlines the order in which equations are applied and the sequence of steps from raw data to final survival prediction.

Table 2.
The algorithm of the proposed HLPL.

Algorithm: HLPL

Inputs: Raw clinical dataset $\tilde{D} = (\tilde{X}, \tilde{Y})$ .

Output: Predicted survival durations $\hat{Y}$ .

Data Preprocessing Phase:

1. Handle outliers: compute $Z_{i, j}$ (Equation16) and cap/remove (Equation 17)

2. Impute missing values via KNN (Equations 18–19)

3. Standardize features $x_{i, j}$ and Transform labels $\tilde{Y}$ (Equations 20–22) to a compatible format

4. Output: preprocessed dataset $D = (X, Y)$

Feature Engineering Phase:

5. for each feature $X_{k}$ :

6. Compute p-values $p_{k}$ using the Wald test

7. Selecting significant features $X_{i m p}$ (Equation 24)

8. Output: survival-relevant feature subset $X_{i m p}$

Cox-SHAP Analysis Phase:

9. Defining feature set $F^{} (X) = {X_{1}^{f}, X_{2}^{f}, \dots, X_{m}^{f}}$ (Equation 25)

10. Fitting Cox proportional hazards model, obtaining coefficients $β_{j}$ (Equations 26–29)

11. Selecting refined features $F^{C o x}$ (Equation 30)

12. Computing SHAP values $ϕ_{j}$ for interpretability (Equation 31)

13. Output: $F^{C o x}$ and $ϕ_{j}$

Pre-clustering Phase:*

14. Constructing feature vectors $X_{i}^{C o x}$ for all patients (Equation 33)

15. Applying K-means clustering to assign cluster labels $C_{i}$ (Equations 34–35)

16. Output: cluster-augmented dataset $D_{c l u s t e r e d}$ (Equations 36–37)

Model Development & Training Phase:

17. Splitting $D_{c l u s t e r e d}$ into training $X_{t r a i n}$ and testing $X_{t e s t}$ sets.

18. Training ML models: SVM (Equations 38–39), Random Forest (Equation 40), Gradient Boosting (Equation 41)

19. Training deep learning model: fully connected DNN (Equations 42–43)

20. Fusing ML and DL predictions via weighted average (Equation 44)

21. Output: final predicted survival durations $\hat{Y}$

5 Performance evaluation

This section focuses on assessing performance improvement of the proposed HLPL compared to traditional ML and DL approaches for liver cancer prediction. Conventional methods, such as Cox proportional hazards models, struggle with the complexities of high-dimensional and heterogeneous data, often leading to suboptimal results in predicting patient survival. Furthermore, while some ML models, such as Random Forest and Gradient Boosting, offer improved accuracy, they lack the robustness needed to address the diverse characteristics present within the patient population. The proposed HLPL leverages both ML and DL techniques synergistically, integrating K-means pre-clustering to account for population heterogeneity. By utilizing pre-clustering, HLPL segments patients into homogeneous groups before predictive modeling, allowing for more targeted and accurate predictions. Additionally, the system's ensemble model combines predictions from Support Vector Machines, Random Forests, and DNNs, leveraging the strengths of each to achieve higher prediction accuracy and reliability. The following sections provide an overview of the simulation environment, evaluation metrics, and a comparative analysis of simulation results between HLPL and baseline methods.

5.1 Simulation environmental parameters and dataset

Table 3 summarizes the simulation parameters used throughout this study. The experiment is conducted on a system running Windows 10, with Anaconda as the primary development platform. The hardware includes 16GB of RAM and an NVIDIA GeForce RTX 3060 Laptop GPU with 6GB of memory.

Table 3.
Experimental parameters.

Parameters Values

Number of Patients 673

Number of Fields per Patient 127

Risk Model Input Dimensions 67 Dimensions

Multiclass Labels 8 categories

Parameters	Values
Number of Patients	673
Number of Fields per Patient	127
Risk Model Input Dimensions	67 Dimensions
Multiclass Labels	8 categories

The dataset consists of 673 liver cancer patients collected between 2015 and 2022 from our institutional database. This single-center retrospective cohort includes patients with pathologically confirmed HCC. Inclusion criteria included histologically confirmed HCC, complete clinical and laboratory records, and follow-up ≥1 month. Exclusion criteria included missing survival data, prior liver transplantation, and concurrent malignancies other than HCC. The dataset includes 127 features across several domains. No genomic data was included in this study. Table 4 summarizes patient demographics, tumor characteristics, and survival event rates.

Table 4.

Experimental parameters.

Characteristic	Value
Age (mean ± SD)	61.5 ± 10.2 years
Sex (M/F)	452/221
Tumor stage (I/II/III/IV)	120/210/230/113
Primary site surgery	None: 210, Ablation: 180, Resection: 283
Median follow-up	36 months
Event rate (death)	45%

The dataset was randomly split into training (70%), validation (15%), and testing (15%) sets. All preprocessing, feature selection, and K-means clustering were applied only to the training set to avoid data leakage, and learned transformations were applied to the validation and test sets without re-fitting. Five-fold cross-validation was applied on the training set for hyperparameter tuning, while the test set was reserved for final unbiased evaluation. Class imbalance was addressed by applying class weights during training, while the test set remained unchanged.

5.2 Simulation results

To evaluate the impact of training dataset size and clustering on model performance, Figures 1, 2, and 3 illustrate the performance of the proposed HLPL framework in terms of Accuracy, Precision, and F1-score by varying the ratio of training data and the number of clusters, respectively. The ratio of training data increases progressively, while the number of clusters is adjusted to evaluate its impact on model performance. As shown in Figures 1–3, all metrics exhibit a consistent upward trend as the training data size increases. This is because a larger training dataset enables the model to learn more comprehensive and representative patterns of patient survival characteristics, thereby improving predictive performance and generalization ability.

Figure 1.

Impact of data volume and clustering on accuracy.

Figure 2.

Impact of data volume and clustering on precision.

Figure 3.

Impact of data volume and clustering on F1-score.

In addition, increasing the number of clusters further enhances Precision and F1-score. The reason is that clustering partitions patients into more homogeneous subgroups, reducing inter-patient heterogeneity and allowing the model to learn subgroup-specific patterns more effectively. This leads to more accurate classification and better balance between precision and recall. Overall, the proposed HLPL framework achieves improved performance across all evaluation metrics under larger data volumes and appropriate clustering settings. This superior performance can be attributed to the integration of clustering and ensemble learning, which jointly enhances the model's ability to capture complex clinical patterns and improve robustness in multi-class survival prediction.

Figures 4, 5, and 6 illustrate the survival distributions of liver cancer patients under different surgical interventions, including no resection, tumor ablation, and resection surgery, respectively. As shown in Figures 4–6, survival outcomes vary significantly across treatment types. Patients without resection show the poorest prognosis, with a large proportion in the shortest survival category, due to the lack of effective tumor control. In contrast, tumor ablation improves short- and mid-term survival, as it can effectively suppress tumor progression in selected patients. Among all groups, resection surgery achieves the best long-term survival outcomes, since complete tumor removal reduces disease burden and improves prognosis. Overall, these results indicate that surgical intervention, particularly resection, plays a critical role in improving survival outcomes, which is consistent with clinical observations.

Figure 4.

Survival distribution for patients without primary site resection.

Figure 5.

Survival distribution for patients with tumor ablation.

Figure 6.

Survival distribution for patients with resection surgery.

Figures 7, 8, and 9 compare the performance of the proposed HLPL framework in terms of Accuracy, Precision, and F1-score by varying the training data size across different surgical groups. As shown in Figures 7–9, all metrics increase with the size of training data, as larger datasets enable the model to learn more representative survival patterns and improve generalization. In addition, performance differs across surgical types, with resection patients achieving the best results, followed by tumor ablation, while non-surgical patients show lower performance. This is because surgical groups exhibit clearer survival patterns, making them easier to model, whereas non-surgical cases are more heterogeneous. Overall, the proposed HLPL framework demonstrates robust performance across different treatment groups, with improved accuracy and balance as data volume increases. This performance gain is attributed to the model's ability to integrate heterogeneous clinical information and leverage both ML and DL techniques, enabling accurate and stable survival prediction across diverse clinical scenarios.

Figure 7.

Impact of data volume and surgical type on accuracy.

Figure 8.

Impact of data volume and surgical type on precision.

Figure 9.

Impact of data volume and surgical type on F1-score.

To validate the clinical relevance of clustering, Figure 10 presents the relationship between the number of clusters and the Silhouette Score, a standard measure of clustering quality. The score stabilizes at four clusters, indicating an optimal balance between intra-cluster cohesion and inter-cluster separation. This clustering strategy improves the model's ability to capture patient heterogeneity by grouping individuals into more homogeneous subpopulations, enabling more accurate and personalized survival predictions. Furthermore, distinct survival trends are observed across clusters, suggesting that the clustering process captures meaningful prognostic stratification rather than merely performing mathematical partitioning.

Figure 10.

Relationship between cluster count and silhouette score.

Figure 11.

Feature importance analysis using random forest base model.

The significance of clinical features was assessed using SHAP values across three base models: Random Forest, Gradient Boosting, and XGBoost. SHAP values provide an interpretable measure of each feature's contribution to model predictions, enhancing both clinical insight and model transparency. Across all models, clinical cancer stage, primary site surgery type, tumor size, AFP level, and liver function indicators consistently emerged as the most influential predictors. In Figure 11 (Random Forest), clinical cancer stage and primary site surgery ranked highest across patients, indicating strong associations with survival outcomes. Figures 12 and 13 (Gradient Boosting and XGBoost) confirmed these findings, demonstrating the robustness of feature importance across different models. Clinically, these top features correspond to well-established prognostic factors in liver cancer: higher cancer stages are associated with poorer survival; curative surgery improves outcomes; larger tumors and elevated AFP levels indicate higher risk; and abnormal liver function reflects compromised hepatic reserve, which affects both prognosis and treatment tolerance. Moreover, SHAP analysis reveals potential nonlinear effects and interactions among features—for instance, the impact of AFP on survival may depend on tumor size or liver function status. These interpretable insights provide clinicians with actionable information for individualized risk assessment, prognostic refinement, and treatment planning.

Figure 12.

Feature importance analysis using gradient boosting base model.

Figure 13.

Feature importance analysis using XGBoost base model.

Table 5 compares the performance of the proposed HLPL framework with several baseline methods for liver cancer survival prediction. Traditional statistical models, such as the Cox proportional hazards model,⁵ achieve relatively low performance due to assumptions like proportional hazards and limited ability to capture nonlinear relationships. Classical ML methods (SVM,¹⁰ Random Forest¹³) improve performance, while ensemble methods like Gradient Boosting²³ further enhance results. DNN²⁸ achieves higher accuracy and F1-score, and the hybrid ML-DL model²⁹ improves further. Notably, the proposed HLPL framework attains the best performance across all metrics. This superior performance can be attributed to Cox-SHAP feature evaluation, pre-clustering to address patient heterogeneity, and ensemble integration of ML and DL models, which together enhance predictive accuracy and robustness.

Table 5.

Performance comparison of different models on liver cancer survival prediction.

Method	Accuracy	Precision	Recall	F1-score
Cox Proportional Hazards⁵	0.68	0.65	0.63	0.64
SVM¹⁰	0.74	0.72	0.71	0.71
Random Forest¹³	0.78	0.76	0.75	0.75
Gradient Boosting²³	0.80	0.78	0.77	0.77
DNN²⁸	0.82	0.80	0.79	0.79
Hybrid ML-DL²⁹	0.84	0.82	0.81	0.81
Proposed HLPL	0.88	0.86	0.85	0.85

6 Conclusion

This study presents HLPL, a comprehensive survival analysis framework that accurately predicts the survival duration of liver cancer patients. By combining rigorous data preprocessing and feature engineering, the framework ensures data quality and model robustness. The integration of the Cox proportional hazards model with SHAP value analysis provides interpretable insights into key risk factors, such as tumor stage, liver function, and surgical interventions, linking model predictions directly to underlying physiological and clinical characteristics. K-means clustering captures latent patient subgroups with distinct survival profiles, reflecting the heterogeneity in tumor biology and treatment responses. The combination of ML and DL models achieves high predictive accuracy while maintaining robustness, highlighting the potential of data-driven approaches to complement clinical decision-making.

Despite these promising results, several limitations should be noted. The dataset is derived from a single institution, which may limit population diversity and introduce potential biases related to local clinical practices, thereby affecting model generalizability. In addition, external validation on independent cohorts has not yet been conducted, limiting the assessment of model robustness. Although the HLPL framework is designed to be flexible and extensible, its generalization to other clinical scenarios requires further validation. Additionally, the number of clusters is determined empirically, which may influence model stability, and the current formulation does not fully capture the temporal dynamics of survival processes.

Future work will address these limitations by performing multi-center validation to assess robustness and generalizability, integrating multimodal data such as imaging and genomics to improve predictive performance and interpretability, and developing more advanced survival modeling techniques and real-time clinical decision support systems for practical deployment.

Footnotes

Ethics approval and consent to participate

Not Applicable for the study.

Author contributions

ChihYung Chang: Conceptualization, methodology, formal analysis, writing-original-draft, and experiment design. Chin-Hwa Kuo: Conceptualization, methodology, formal analysis, writing-original-draft, and experiment design. Tsung-Jung Lin: Methodology, writing-original-draft, and experiment design. Yeh Chen: writing-original-draft, and experiment design. Diptendu Sinha Roy: Writing-review-editing, and supervision.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability

The datasets generated during the current study are not publicly available but are available from the corresponding author.

References

Wakil

Mazzaferro

, et al. Trends of hepatocellular carcinoma (HCC) inpatients mortality and financial burden from 2011 to 2017: a nationwide analysis. J Clin Gastroenterol 2024; 58: 85–90.

Huang

Geller

, et al. Macrophage metabolism, phenotype, function, and therapy in hepatocellular carcinoma (HCC). J Transl Med 2023; 21: 815.

Toh

Wong

EYT

Wong

, et al. Global epidemiology and genetics of hepatocellular carcinoma. Gastroenterology 2023; 164: 766–782.

Mauro

Forner

. Barcelona clinic liver cancer 2022 update: linking prognosis prediction and evidence-based treatment recommendation with multidisciplinary clinical decision-making. Liver Int 2022; 42: 488.

McLernon

Giardiello

Van Calster

, et al. Assessing performance and clinical usefulness in prediction models with survival outcomes: practical guidance for cox proportional hazards models. Ann Intern Med 2023; 176: 105–114.

Qiu

Gao

Yang

, et al. A comparison study of machine learning (random survival forest) and classic statistic (cox proportional hazards) for predicting progression in high-grade glioma after proton and carbon ion radiotherapy. Front Oncol 2020; 10: 551420.

Ouyang

Pan

, et al. A robust twelve-gene signature for prognosis prediction of hepatocellular carcinoma. Cancer Cell Int 2020; 20: 207.

Deo

. Machine learning in medicine. Circulation 2015; 132: 1920–1930.

Moncada-Torres

van Maaren

Hendriks

, et al. Explainable machine learning can outperform Cox regression predictions and provide insights in breast cancer survival. Sci Rep 2021; 11: 6968.

10.

Spooner

Chen

Sowmya

, et al. A comparison of machine learning methods for survival analysis of high-dimensional clinical data for dementia prediction. Sci Rep 2020; 10: 20410.

11.

Kashif

Bakhtawar

Akhtar

, et al. Treatment response prediction in hepatitis C patients using machine learning techniques. International Journal of Technology, Innovation and Management (IJTIM) 2021; 1: 79–89.

12.

Park

Cho

Kim

, et al. Machine learning prediction of incidence of Alzheimer’s disease using large-scale administrative health data. npj Digit Med 2020; 3: 46.

13.

Schoenberg

Bucher

Koch

, et al. A novel machine learning algorithm to predict disease free survival after resection of hepatocellular carcinoma. Ann Transl Med 2020; 8: 434.

14.

Morales Ferez

Mill

Juhl

, et al. Deep learning framework for real-time estimation of in-silico thrombotic risk indices in the left atrial appendage. Front Physiol 2021; 12: 694945.

15.

Cai

Lambregts

Beets

, et al. An automated deep learning pipeline for EMVI classification and response prediction of rectal cancer using baseline MRI: a multi-centre study. NPJ Precision Oncology 2024; 8: 17.

16.

Zhang

Yang

Chen

, et al. Histopathology images-based deep learning prediction of prognosis and therapeutic response in small cell lung cancer. npj Digit Med 2024; 7: 15.

17.

Park

Silva

Singhal

, et al. A deep learning model of tumor cell architecture elucidates response and resistance to CDK4/6 inhibitors. Nature Cancer 2024; 5: 996–1009.

18.

Keyl

Hosch

Berger

, et al. Deep learning-based assessment of body composition and liver tumour burden for survival modelling in advanced colorectal cancer. J Cachexia Sarcopenia Muscle 2023; 14: 545–552.

19.

Zhao

You

Bai

, et al. Machine learning-based construction of a ferroptosis and necroptosis associated lncRNA signature for predicting prognosis and immunotherapy response in hepatocellular cancer. Front Oncol 2023; 13: 1171878.

20.

Wang

, et al. Predicting postoperative liver cancer death outcomes with machine learning. Curr Med Res Opin 2021; 37: 629–634.

21.

Xie

Hong

Liu

, et al. Interpretable machine learning-based clinical prediction model for predicting lymph node metastasis in patients with intrahepatic cholangiocarcinoma. BMC Gastroenterol 2024; 24: 137.

22.

Zhang

, et al. Development and experimental validation of a machine learning-based disulfidptosis-related ferroptosis score for hepatocellular carcinoma. Apoptosis 2024; 29: 103–120.

23.

Sun

, et al. Machine learning to improve prognosis prediction of early hepatocellular carcinoma after surgical resection. Journal of Hepatocellular Carcinoma 2021; 8: 913–923.

24.

Wang

, et al. Machine learning-based prognostic modeling of lysosome-related genes for predicting prognosis and immune status of patients with hepatocellular carcinoma. Front Immunol 2023; 14: 1169256.

25.

Huang

Chen

Zeng

, et al. Development and validation of a machine learning prognostic model for hepatocellular carcinoma recurrence after surgical resection. Front Oncol 2021; 10: 593741.

26.

Huang

Johnson

Han

, et al. Deep learning-based cancer survival prognosis from RNA-Seq data: approaches and evaluations. BMC Med Genet 2020; 13: 41.

27.

Bhambhvani

Zamora

Velaer

, et al. Deep learning enabled prediction of 5-year survival in pediatric genitourinary rhabdomyosarcoma. Surg Oncol 2021; 36: 23–27.

28.

Lai

Chen

Hsu

, et al. Overall survival prediction of non-small cell lung cancer by integrating microarray and clinical data with deep learning. Sci Rep 2020; 10: 4679.

29.

Vale-Silva

Rohr

. Long-term cancer survival prediction using multimodal deep learning. Sci Rep 2021; 11: 13505.

30.

Wulczyn

Steiner

, et al. Deep learning-based survival prediction for multiple cancer types using histopathology images. PloS one 2020; 15: 0233678.

31.

Asif

Arif

Mukheimer

. A data-driven approach with explainable artificial intelligence for customer churn prediction in the telecommunications industry. Results in Engineering 2025; 26: 104629.

32.

Arif

Mukheimer

Asif

. Enhancing the early detection of chronic kidney disease: a robust machine learning model. Big Data Cogn Comput 2023; 7: 144.

33.

Arif

Rehman

Asif

. Explainable machine learning model for chronic kidney disease prediction. Algorithms 2024; 17: 443.

34.

Salehi

Saadatfar

Oyelere

, et al. Enhancing healthcare outcome with scalable processing and predictive analytics via cloud healthcare API. Front Digit Health 2025; 7: 1687131.

35.

Kalita

El Aouifi

Kukkar

, et al. LSTM-SHAP based academic performance prediction for disabled learners in virtual learning environments: a statistical analysis approach. Soc Netw Anal Min 2025; 15: 65.

36.

Cox

. Regression models and life-tables. Journal of the Royal Statistical Society: Series B (Methodological) 1972; 34: 187–202.

37.

Lundberg

Lee

. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst 2017; 30: 1–10.

38.

Darken

Moody

. Fast adaptive k-means clustering: some empirical results, International Joint Conference on Neural Networks (IJCNN), IEEE, 1990.

39.

Cortes

Vapnik

. Support-vector networks. Mach Learn 1995; 20: 273–297.

Algorithm: HLPL
Inputs: Raw clinical dataset $\tilde{D} = (\tilde{X}, \tilde{Y})$ .
Output: Predicted survival durations $\hat{Y}$ .
	Data Preprocessing Phase:
1.	Handle outliers: compute $Z_{i, j}$ (Equation16) and cap/remove (Equation 17)
2.	Impute missing values via KNN (Equations 18–19)
3.	Standardize features $x_{i, j}$ and Transform labels $\tilde{Y}$ (Equations 20–22) to a compatible format
4.	Output: preprocessed dataset $D = (X, Y)$
	Feature Engineering Phase:
5.	for each feature $X_{k}$ :
6.	Compute p-values $p_{k}$ using the Wald test
7.	Selecting significant features $X_{i m p}$ (Equation 24)
8.	Output: survival-relevant feature subset $X_{i m p}$
	Cox-SHAP Analysis Phase:
9.	Defining feature set $F^{*} (X) = {X_{1}^{f}, X_{2}^{f}, \dots, X_{m}^{f}}$ (Equation 25)
10.	Fitting Cox proportional hazards model, obtaining coefficients $β_{j}$ (Equations 26–29)
11.	Selecting refined features $F^{C o x}$ (Equation 30)
12.	Computing SHAP values $ϕ_{j}$ for interpretability (Equation 31)
13.	Output: $F^{C o x}$ and $ϕ_{j}$
	Pre-clustering Phase:
14.	Constructing feature vectors $X_{i}^{C o x}$ for all patients (Equation 33)
15.	Applying K-means clustering to assign cluster labels $C_{i}$ (Equations 34–35)
16.	Output: cluster-augmented dataset $D_{c l u s t e r e d}$ (Equations 36–37)
	Model Development & Training Phase:
17.	Splitting $D_{c l u s t e r e d}$ into training $X_{t r a i n}$ and testing $X_{t e s t}$ sets.
18.	Training ML models: SVM (Equations 38–39), Random Forest (Equation 40), Gradient Boosting (Equation 41)
19.	Training deep learning model: fully connected DNN (Equations 42–43)
20.	Fusing ML and DL predictions via weighted average (Equation 44)
21.	Output: final predicted survival durations $\hat{Y}$

HLPL: H epatic Evaluation for L ife P rediction based on Machine and Deep L earning

Abstract

Keywords

1 Introduction

2.1 ML in cancer prediction

2.2 Deep learning in cancer prediction

3.1 System model and notation definitions

4.1 Data preprocessing phase

5.1 Simulation environmental parameters and dataset

Table 3. Experimental parameters. Parameters Values Number of Patients 673 Number of Fields per Patient 127 Risk Model Input Dimensions 67 Dimensions Multiclass Labels 8 categories

Footnotes

Ethics approval and consent to participate

Author contributions

Funding

Declaration of conflicting interests

Data availability

References

Table 3.
Experimental parameters.

Parameters Values

Number of Patients 673

Number of Fields per Patient 127

Risk Model Input Dimensions 67 Dimensions

Multiclass Labels 8 categories