Integrative multi-modal feature extraction from ECG signals using FusionHeartNet for robust heart disease prediction

Abstract

Electrocardiogram (ECG)-based diagnostics are pivotal in early cardiac disorder detection, yet existing models often fail to integrate temporal, spectral, and spatial dynamics inherent in complex arrhythmic patterns. Most traditional approaches are unimodal, relying either on time-domain signal processing or spatially limited CNN models, thereby overlooking cross-domain dependencies and subtle morphological cues. Addressing this gap, this research proposes FusionHeartNet, a unified deep learning framework that fuses signal- and image-based representations using a dual-spectrum feature embedding (DSFE) strategy. DSFE synergistically extracts morphological descriptors and spectral signatures via Fourier and wavelet transforms, while spatial morphology is preserved through GAF and CWT scalograms. These dual-domain features are refined by a multi-focus attention module (MFAM) and classified through the heart fusion classifier (HFC), which is optimized using Bayesian optimization with adaptive learning rate scheduling (BO-ALRS). Experimental validation on the MIT-BIH Arrhythmia Database demonstrates an accuracy of 98.47%, F1-score of 91.67%, and kappa of 0.9311, significantly outperforming baseline models. FusionHeartNet sets a new benchmark for robust, multi-dimensional ECG analysis, offering clinically viable precision in early heart disease detection.

Keywords

Electrocardiogram (ECG)cardiac disorder detection FusionHeartNet dual-spectrum feature embedding Gramian angular fields (GAF)continuous wavelet transform (CWT) scalograms

Introduction

Heart failure, which is defined by the heart’s diminished capacity to circulate blood effectively, is a significant global health issue. Conventional diagnostic practices have long been constrained by subjective clinical judgments and a narrow spectrum of data inputs, which have hindered progress in early detection and the development of personalized treatment pathways.¹ The intricate nature of cardiovascular diseases demands a more integrated and detailed diagnostic framework. Relying solely on single-modality data has proven inadequate for capturing the dynamic and multifactorial aspects of cardiac physiology, as increasingly recognized in recent medical literature.² In response, contemporary healthcare systems are shifting toward data-intensive diagnostic models that synthesize diverse physiological signals and imaging modalities, aiming to construct a more comprehensive and accurate representation of cardiac health.³ Recent advances in artificial intelligence and machine learning (ML) are opening transformative avenues in cardiovascular diagnostics. These technologies enable the integration of diverse data streams, such as electrocardiographic (ECG) signals, photoplethysmography (PPG) data, and structural cardiac imaging, into unified diagnostic models. Such multimodal approaches offer a significant leap in accuracy and resilience compared to traditional methods that rely on isolated, single-source inputs.⁴

ML and AI aim to develop predictive models by learning from data, emulating certain aspects of human cognitive processes.⁵ Historically, conventional ML algorithms have been limited to single-modality data, such as either medical imaging or clinical textual records. This narrow focus diverges from the way humans interpret the world, which inherently involves the simultaneous integration of multiple sensory inputs, like visual cues and auditory signals, to form coherent judgments.⁶ To enhance model performance beyond what single-modality approaches can achieve, researchers have increasingly focused on developing techniques that combine various types of data, such as visual and auditory inputs. The core principle behind multimodal ML is that each data type offers unique and complementary insights into a particular event or subject, such as recognizing emotions, identifying objects, or diagnosing illnesses.⁷ Multimodal data encompasses diverse formats and sources, and the objective of fusion techniques is to transform these heterogeneous inputs, often differing in scale and distribution, into a unified feature space. This integrated representation can then be utilized for tasks like classification and prediction.⁸

In recent years, the volume of cardiac-related data has surged dramatically, posing challenges to effective feature extraction.⁹ To address these challenges, deep learning (DL) has emerged as a powerful tool for predicting heart disease. Deep neural networks stand out for their ability to automatically detect and extract features, delivering concise and accurate outcomes, especially in tasks like heartbeat classification.¹⁰ Researchers have integrated a wide range of advanced DL methods to harness this capability, utilizing not only single-modal but also multi-modal and hybrid frameworks.¹¹ The layered architecture of deep neural networks captures and transforms features at multiple levels, enhancing model performance. Models such as recurrent neural networks (RNN),¹² long short-term memory (LSTM),¹³ convolutional neural networks (CNN),¹⁴ and various hybrid combinations¹⁵ have been extensively used to overcome the limitations of traditional ML approaches, which often relied on manual and error-prone feature selection.

Although hybrid models may introduce increased computational cost and face limitations due to insufficient high-quality datasets, these issues are often outweighed by their advantages. Accurate heartbeat classification and reliable arrhythmia detection demand large datasets, making the benefits of DL-driven feature automation far more impactful in real-world applications.¹⁶ The key contributions of this research are outlined below to highlight its novelty, technical rigor, and application significance:

Novel DSFE integrates handcrafted temporal descriptors and frequency-domain features with deep-learned spatial representations to capture complex physiological and morphological characteristics of ECG signals.

Multi-modal signal-image fusion with attention is proposed to enhance cross-domain feature interaction by selectively emphasizing clinically salient patterns across signal and image modalities.

Optimization-driven classification strategy employs BO-ALRS to fine-tune model hyperparameters, ensuring stability, convergence, and diagnostic generalization across arrhythmic classes.

Extensive experimentation on the MIT-BIH Arrhythmia dataset validates the proposed FusionHeartNet, achieving superior accuracy, F1-score, and class-wise robustness, particularly for hard-to-detect arrhythmias.

The structure of the paper is as follows: Section 2 reviews recent research developments and challenges in ECG signal processing and multimodal DL techniques. Section 3 outlines the proposed approach, detailing the FusionHeartNet architecture. Section 4 presents experimental findings, including performance metrics, k-fold cross-validation results, ablation analyses, and relevant visual interpretations. Section 5 concludes the paper by summarizing the main contributions, discussing practical applications, and suggesting future research directions. The “References” section provides a comprehensive list of sources that support the theoretical and technical aspects of this work.

Literature survey

Multi-modal learning is a revolutionary concept in the future of medical diagnostics as compared to the use of only one model to train and analyze data. Ahmad et al.¹⁷ introduced two efficient multimodal fusion strategies for ECG heartbeat classification: multimodal image fusion (MIF) and multimodal feature fusion (MFF). Both approaches transform raw ECG signals into three image types: Gramian angular field (GAF), recurrence plot (RP), and Markov transition field (MTF). The MIF method fuses these images into a single composite image and processes it using a CNN to extract features. In contrast, MFF extracts deep features from the penultimate layers of separate CNNs and combines them to capture both distinct and shared information. These combined features are then used to train an SVM classifier for heartbeat classification. The authors validated both methods on the MIT-BIH Arrhythmia dataset for arrhythmia detection and the PTB diagnostic ECG dataset for myocardial infarction (MI) classification, demonstrating strong performance across multiple conditions.

Liu et al.¹⁸ proposed a method for detecting coronary artery disease (CAD) using multi-domain feature fusion from multi-channel heart sound signals. Their approach combines entropy and cross-entropy features to capture complex signal patterns linked to CAD. The study involved 36 participants, 21 with CAD and 15 healthy controls, with heart sounds recorded over 5 min using a five-channel setup. They extracted features from the time, frequency, entropy, and cross-entropy domains and used the optimal set to train a support vector machine (SVM) classifier. The classification accuracy improved from 78.75% with single-channel data to 86.70% using multi-channel input, and further reached 90.92% after incorporating entropy-based features.

Cardiac arrhythmias remain a serious health concern, with ECG serving as the primary diagnostic tool. However, manual ECG interpretation is often time-consuming and prone to inefficiencies. While machine learning has been widely adopted for ECG analysis, it typically demands lengthy training and manual feature engineering. To address these limitations, Irfan et al.¹⁹ proposed an innovative deep learning framework that stacks similar network layers to form a unified and robust model. Their approach achieved impressive results, with a sensitivity of 98.37%, a specificity of 99.59%, a positive predictive value of 98.41%, and an overall accuracy of 99.35%. These metrics surpassed those of conventional models, both in performance and computational efficiency.

Zheng et al.²⁰ proposed a transfer learning-based CatBoost model using phonocardiogram (PCG) signals for non-invasive detection of left ventricular diastolic dysfunction. They generated four types of spectrograms to capture key patterns within PCG signals and employed four pre-trained CNNs to extract deep, domain-specific features. To improve classification performance, they applied dimensionality reduction techniques such as principal component analysis (PCA) and linear discriminant analysis (LDA) on selected feature subsets. These features were then fused and used as input to a CatBoost classifier. For comparison, the study also evaluated three additional ML models. The proposed method outperformed all others, achieving an AUC of 0.911, accuracy of 0.882, sensitivity of 0.821, specificity of 0.927, and an F1-score of 0.892.

Cheng et al.²¹ analyzed 2D and Doppler transthoracic echocardiography (TTE) images from 1932 pediatric patients across two clinical cohorts at Beijing Children’s Hospital between 2018 and 2022. They developed a deep learning framework that automatically identifies standard cardiac views, integrates multimodal and multiview data, highlights anatomically high-risk regions, and predicts the presence of congenital heart defects (CHDs) such as atrial septal defect (ASD) and ventricular septal defect (VSD). The system achieved a mean accuracy of 0.989 in classifying cardiac views and 0.996 in screening for CHDs using five standardized TTE views across both 2D and Doppler modalities. Additionally, it reached a classification accuracy of 0.991 in both within-center and cross-center evaluations for distinguishing healthy cases from those with ASD or VSD. By leveraging diverse imaging perspectives, the proposed model significantly enhances the precision and accessibility of non-invasive CHD screening in pediatric care.

Acute coronary syndromes (ACS) remain a leading cause of mortality worldwide, yet conventional cardiovascular risk scoring (CVRS) systems frequently neglect environmental contributors such as air pollution. To bridge this gap, Zhang et al.²² introduced TabulaTime, an advanced multimodal DL framework designed to integrate traditional clinical risk factors with environmental exposure data, particularly air pollution metrics. This approach enhances the predictive capability of ACS risk models by incorporating non-traditional yet impactful determinants of cardiovascular health. TabulaTime introduces three innovations: multimodal feature integration, PatchRWKV for automatic extraction of complex temporal patterns, and enhanced interpretability using attention mechanisms. The experimental evaluation showed TabulaTime achieves a 20.5% improvement in accuracy compared to traditional models, surpassing other models by 20.5%–32.2%. The integration of air pollution data improves accuracy by 10.1%, emphasizing the importance of environmental factors.

The CardioNet+ framework developed by Adeyi et al.,²³ is a significant advancement in DL for heart failure detection, combining functional and structural characteristics. It achieved an accuracy rate of 99.1%, an F1-score of 98.3%, and an AUC–ROC of 99.0%, outperforming single-modal models. The framework uses multi-head attention to capture temporal ECG/PPG features and ResNet-50’s pre-trained architecture to learn spatial features from X-ray images. The application of the SMOTE effectively addresses class imbalance, thereby improving model generalizability, particularly for underrepresented categories such as heart failure cases. To facilitate clinical interpretability and trust, the framework incorporates Grad-CAM for visual explanation of chest X-ray predictions and attention-based heatmaps for ECG and PPG signals. These interpretability mechanisms support transparent decision-making and promote seamless integration of the model into routine clinical workflows.

Proposed methodology

ECG signals are a cornerstone in cardiovascular diagnostics, providing a non-invasive means of capturing detailed physiological and pathological information related to cardiac function. Acknowledging the shortcomings of traditional diagnostic methods, this study presents FusionHeartNet, an innovative and unified deep learning framework specifically designed for the early detection of heart disease. Hence, leveraging multimodal DL architecture, the proposed system enables precise and timely classification of arrhythmias, as illustrated in Figure 1. FusionHeartNet addresses both data heterogeneity and diagnostic complexity, offering a robust solution for enhancing clinical decision-making in cardiology.

Figure 1.

Architecture of the proposed framework.

Traditional electrocardiographic analysis techniques, often confined to either time-domain or frequency-domain processing, frequently fall short in capturing the intricate interplay between temporal transitions and spectral variations, both of which are vital for detecting nuanced, early-stage arrhythmic signatures. To bridge this gap, the proposed FusionHeartNet framework introduces a DSFE strategy that extracts complementary time-based morphological indicators (e.g. QRS duration, R–R intervals) alongside frequency-domain descriptors derived from Fourier and wavelet transformations. These enriched features are then modeled using a temporal convolutional network (TCN), which effectively captures long-range dependencies and subtle temporal fluctuations inherent in ECG signals. A further shortcoming of conventional approaches lies in their predominantly one-dimensional analytical perspective, which often disregards spatial morphology, an essential factor in enhancing pathological pattern recognition. To address this, FusionHeartNet transforms ECG signals into image-based representations using GAF and CWT scalograms. These spatial encodings are subsequently analyzed by EfficientNet-B0, a deep convolutional neural network that excels in identifying fine-grained visual patterns and transient anomalies.

Moreover, existing methodologies frequently suffer from the absence of a cohesive fusion mechanism to integrate multi-domain features, leading to fragmented interpretations and reduced diagnostic reliability. FusionHeartNet overcomes this challenge by incorporating a multi-focus attention module (MFAM), which leverages a multi-head attention architecture to dynamically prioritize the most salient features across both signal-based and image-based modalities. The final diagnostic decision is made through the HFC, an optimized ensemble classification network whose performance is enhanced using BO-ALRS. This process optimizes hyperparameters, including learning rate, dropout rate, and attention depth, to enhance model generalization and robustness. In summary, FusionHeartNet represents a comprehensive and unified diagnostic paradigm that seamlessly integrates temporal, spectral, and spatial feature domains. By doing so, it significantly advances the state of automated cardiovascular diagnostics, enabling accurate and early detection of heart disease with high clinical relevance.

Dual-spectrum feature embedding (DSFE)

Conventional ECG classification methods often rely on either time-domain or frequency-domain analysis in isolation, which limits their ability to capture the intricate interdependencies between temporal dynamics and spectral shifts—especially critical in detecting transient or evolving arrhythmic episodes. Moreover, these approaches typically treat ECG signals as one-dimensional time series, thereby overlooking morphological characteristics that are more effectively represented and interpreted through spatially structured formats. This limitation restricts the capacity of such models to recognize complex patterns indicative of pathological states.

The DSFE strategy mitigates these shortcomings by employing a bimodal feature extraction approach, wherein the ECG signal is concurrently analyzed as a raw time-series waveform in the signal domain and as a transformed spatial representation in the image domain, as illustrated in Figure 2. This approach captures temporal-frequency descriptors alongside morphological textures, generating a hybrid representation that forms a dual-spectrum embedding. As a result, the framework facilitates comprehensive learning from diverse physiological signal characteristics, improving the detection of subtle and complex cardiac anomalies.

Figure 2.

Flow diagram of the proposed DSFE method.

Signal domain representation

The signal domain processing of the proposed DSFE module is crafted to extract diagnostically relevant information by blending handcrafted physiological descriptors with learned deep features. This dual-pathway design addresses a notable limitation in existing methods that often rely solely on either rule-based feature extraction or end-to-end learning, failing to incorporate domain knowledge into the learning loop.

Let $X (t) \in ℝ^{T}$ denote a single-channel, denoised ECG time series of length $T$ . The process begins with morphological analysis, which extracts key interval-based descriptors that accurately reflect underlying electrophysiological health.

QRS duration $d_{Q R S}$ , PR interval $d_{P R}$ , ST segment $d_{S T}$ , RR intervals ${r_{i}}_{i = 1}^{N}$ , Amplitude features ${a_{P}, a_{R}, a_{T}}$ where NN is the number of beats and $a_{P}, a_{R}, a_{T}$ represent amplitudes of the P-wave, R-peak, and T-wave, respectively.

These handcrafted features form the morphological feature vector:

f_{m o r p h} = [d_{Q R S}, d_{P R}, d_{S T}, μ_{r}, σ_{r}, a_{P}, a_{R}, a_{T}] \in ℝ^{8}

(1)

Here, $μ_{r}$ and $σ_{r}$ represent the mean and standard deviation of the RR intervals, respectively.

For complementing temporal morphology, frequency-based features are extracted using two orthogonal signal decomposition techniques:

The process begins by applying the short-time Fourier transform (STFT) to extract spectral energy localized across both time and frequency domains. Then signal is windowed by a function $w (t)$ and transformed as:

S T F T_{X} (t, ω) = \int_{- \infty}^{\infty} X (τ) w (τ - t) e^{- j ω τ} d τ

(2)

Yielding a 2D time-frequency energy map $S_{f} \in R^{T' \times F}$ . Here, $T^{'}$ denotes the number of time windows, and $F$ represents the number of frequency bins.

The discrete wavelet transform (DWT) subsequently decomposes the signal into multiple frequency sub-bands through multiresolution analysis.

D W T X = \sum_{j = 1}^{J} \sum_{k} c_{j, k} ψ_{j, k} (t)

(3)

Where $ψ_{j, k} (t)$ denotes wavelet basis functions at scale $j$ and translation $k$ , and $c_{j, k}$ are the detail coefficients. These coefficients form the multi-resolution feature vector $f_{d w t}$ .

A temporal convolutional network (TCN) processes the fused temporal and frequency-based features, leveraging causal and dilated convolutions to model long-range sequential dependencies with high temporal fidelity. For a given input sequence $x = [x_{1}, x_{2}, \dots, x_{T}]$ , the TCN layer output at position $t$ is computed as:

y_{t} = \sum_{i = 0}^{k - 1} w_{i} \cdot x_{t - d \cdot i}

(4)

Where $k$ is the filter size, $d$ is the dilation factor, $w_{i}$ are the learnable convolution weights.

The dilation factor $d$ increases exponentially across layers $l$ , such that $d l = 2^{l}$ , enabling exponentially growing receptive fields without loss of resolution. This enables the model to capture long-range dependencies and beat-to-beat variations, which are crucial for identifying sparsely occurring arrhythmic transitions.

The final signal domain embedding is:

F_{s i g n a l} = T C N ([f_{m o r p h}, f_{d w t}, f l a t t e n (S_{f})])

(5)

Where $F_{s i g n a l} \in R^{D}$ is a dense feature vector learned from both handcrafted and spectral embedding. This representation not only captures the required markers explicitly encoded by clinicians but also retains fine-grained temporal-spectral correlations that are difficult to model in RNN or CNN-only systems. It effectively bridges the gap between physiological interpretability and deep representation power.

While the signal domain captures temporal-spectral patterns as given in Figure 3, essential for functional diagnosis, spatial image representations more effectively reveal structural and morphological distortions in the ECG, such as waveform bifurcations and baseline drifts. Hence, the next module focuses on image-domain embeddings derived from signal-to-image transformations.

Figure 3.

Process flow for signal domain-based features.

Image domain representation

The ECG signal is transformed into two-dimensional image representations, and two complementary techniques are employed to recover spatial and morphological details that are lost in one-dimensional sequences. This study introduces a dual-image encoding strategy to address the critical gap in recognizing spatially distributed morphological distortions, such as waveform bifurcations, baseline wander, and arrhythmic spikes that are often undetectable in traditional one-dimensional signal analysis.

This image-based module, given in Figure 4, augments temporal signal features by embedding ECG signals in a two-dimensional space, enabling advanced convolutional feature extraction and enhanced visual pattern learning. Let the pre-processed ECG signal be denoted as $X (t) \in R^{T}$ . To enable image-based representation, two orthogonal transformations are applied: GAF and CWT. These transformations reshape the time series into structured 2D matrices, enabling CNNs to learn spatially-aware feature representations.

Figure 4.

Process flow diagram of the image domain-based feature module.

The GAF transformation maps the normalized time series into a polar coordinate system, defined by:

x_{i}^{'} = \frac{X_{i} - \min (X)}{m a x (X) - m i n (X)}

(6)

θ_{i} = c o s^{- 1} (x_{i}^{'})

(7)

r_{i} = \frac{i}{T}, i = 1, \dots, T

(8)

The Gramian angular summation field is then constructed as:

G_{i, j} = c o s (θ_{i} + θ_{j}), \forall i, j \in {1, \dots, T}

(9)

This operation preserves the temporal dependency and encodes correlations in angular space, producing a symmetric image $G \in R^{T \times T}$ where phase continuity and signal morphology are captured.

Simultaneously, CWT scalograms are generated to provide a time-frequency representation that effectively captures frequency variations and short-duration spikes. The CWT, based on a chosen mother wavelet $ψ (t),$ is mathematically expressed as:

W (i, j) = \int_{- \infty}^{\infty} X (t) \frac{1}{\sqrt{| i |}} ψ^{*} (\frac{t - j}{i}) d t

(10)

Where $i \in ℝ^{+}$ is the scale (inverse of frequency), $j \in ℝ$ is the translation (time shift), $ψ^{*}$ denotes the complex conjugate of the wavelet function.

The resulting scalogram matrix $W \in R^{A \times B}$ encodes amplitude variations across time-frequency planes, highlighting localized abnormalities like atrial and ventricular ectopics.

The GAF image $G$ and CWT image $W$ are then fed in parallel to a shared EfficientNet-B0 encoder. EfficientNet-B0 is selected for its compound scaling strategy, which uniformly scales network depth, width, and resolution using a fixed coefficient $ϕ$ , as defined by:

\begin{array}{l} d = x^{ϕ}, w = y^{ϕ}, r = z^{ϕ}, \\ s u b j e c t t o x \cdot y^{2} \cdot z^{2} \approx 2, x, y, z > 0 \end{array}

(11)

This controlled scaling mechanism preserves computational efficiency while enabling the model to effectively learn complex hierarchical features. Then, all input images are passed through a convolutional backbone, resulting in the generation of latent feature maps:

F_{g a f} = E f f N e t_{0} (G), F_{c w t} = E f f N e t_{0} (W)

(12)

These representations are then concatenated as follows:

F_{i m a g e} = [F_{g a f} ∥ F_{c w t}] \in R_{M}

(13)

Where ∥ parallel denotes vector concatenation and $M$ is the resulting dimension after flattening.

This dual-view embedding overcomes the limitation of existing image-based methods that typically depend on a single image modality, which often fails to fully utilize both phase and frequency information. By combining GAF and CWT, the approach captures the ECG signal’s spatiotemporal and spectral morphology from multiple perspectives, thereby enhancing the accuracy of classifying complex arrhythmic patterns. Having derived a comprehensive representation from both signal and image domains, the next logical step is the effective integration of these heterogeneous features. The following subsection describes the multi-modal fusion and classification pipeline, FusionNet, which synergistically aligns and discriminates these complementary embeddings for robust early detection.

FusionNet

After obtaining dual-stream features from the signal and image domains, FusionNet performs alignment, selection, optimization, and classification of these heterogeneous features in an end-to-end trainable pipeline. Most existing models use simple concatenation to fuse multimodal features, without modeling their relative importance or resolving dimensional misalignment. Furthermore, these models lack a principled strategy for hyperparameter tuning, leading to non-optimal configurations and overfitting, especially in limited medical datasets. FusionNet combines multimodal attention-based fusion, Bayesian-guided hyperparameter tuning, and a deep neural classification architecture. This integration facilitates both effective feature fusion and adaptive learning, dynamically aligning with the relevance of input features and the size of the dataset.

Multi-focus attention module (MFAM)

Integrating heterogeneous features from both signal and image domains remains a critical challenge in multimodal ECG analysis. Traditional fusion strategies, such as direct concatenation or fixed-weight averaging, often overlook the distinct significance and complex interrelations inherent to each modality. As a result, they tend to introduce redundant information and compromise diagnostic precision. This limitation underscores the need for a more sophisticated fusion mechanism, one capable of adaptively highlighting the most salient features from each domain while maintaining their complementary synergy.

The MFAM addresses this critical limitation by employing a multi-head self-attention mechanism designed to independently and jointly learn attention maps over signal-domain features $F_{s i g n a l} \in R^{N}$ and image-domain features $F_{i m a g e} \in R_{M}$ . Formally, MFAM generates modality-specific attention vectors $A_{s i g n a l}$ and $A_{i m a g e}$ alongside a cross-domain interaction matrix $A_{c r o s s}$ , which collectively enhance feature discrimination.

Each attention head $h$ computes:

Q_{h} = W_{h}^{Q} F

(14)

K_{h} = W_{h}^{K} F

(15)

V_{h} = W_{h}^{V} F

(16)

where $W_{h}^{Q} W_{h}^{K}$ , $W_{h}^{V}$ are learned projection matrices and F is the concatenated feature vector $[F s i g n a l ∥ F i m a g e]$ .

The scaled dot-product attention for each head is computed as:

A t t e n t i o n_{h} (Q_{h}, K_{h}, V_{h}) = s o f t m a x (\frac{Q_{h} K_{h}^{⊤}}{\sqrt{d_{k}}}) V_{h}

(17)

Here, $d_{k}$ denotes the dimensionality of the key vectors, included to ensure numerical stability during the attention computation.

The multi-head attention features are then concatenated across all heads, projected via $W^{O}$ :

M F A M (F) = C o n c a t (A t t e n t i o n_{1}, \dots, A t t e n t i o n_{H}) W^{O}

(18)

The MFAM architecture is designed to capture diverse regions of importance across signal and image modalities, effectively modeling intricate dependencies, such as temporal fluctuations that correspond with localized spatial anomalies. In contrast to simplistic fusion approaches that uniformly weight all inputs, MFAM employs domain-aware adaptive weighting, dynamically adjusting attention scores based on modality-specific relevance. This enables the model to downregulate noisy or conflicting features from one domain when stronger, corroborating signals are present in the other. Furthermore, it strengthens cross-modal synergy by aligning temporally significant signal patterns with spatial anomalies in the image domain. Importantly, MFAM maintains the unique discriminative strengths of each modality while learning their interrelationships, an aspect often neglected by conventional fusion techniques.

This results in a robust fused feature vector F_fused that maximizes informative content and minimizes interference.

F_{f u s e d} = M F A M ([F s i g n a l ∥ F i m a g e])

(19)

After feature fusion, the fused feature set $F_{f u s e d}$ is processed by the hybrid fully connected classifier (HFC), which consists of fully connected layers followed by a SoftMax output layer to classify the ECG signals into arrhythmia categories. The classifier benefits from the rich, selectively fused embeddings, improving sensitivity to subtle early-stage abnormalities.

To achieve optimal learning efficiency, key hyperparameters, including learning rate, dropout rate, convolutional filter dimensions, and the number of attention heads, are systematically optimized using BO-ALRS. This adaptive tuning framework addresses the inefficiencies of manual hyperparameter selection, promoting faster convergence and improved model generalization. By uniting domain-aware attention fusion with automated classifier optimization, FusionNet establishes a robust and unified pipeline that fully exploits the richness of multimodal ECG features, thereby enabling more accurate and earlier detection of cardiac abnormalities.

BO-ALRS hyperparameter optimization

Deep architectures like FusionNet, which process multimodal inputs from both signal and image domains, involve a highly complex hyperparameter space. Parameters such as dilation rates in TCNs, convolutional kernel sizes, number of attention heads, dropout rates, and learning rate schedules all play critical roles in determining model convergence, generalization capacity, and diagnostic performance. Conventional tuning strategies such as manual adjustment, grid search, or random sampling are often computationally intensive and overlook the intricate interdependencies among these parameters. To overcome this, the present study introduces a BO-ALRS approach, a dual-stage optimization mechanism, that is, both data-driven and sensitive to convergence dynamics, thereby addressing a key shortcoming in existing model optimization practices.

Bayesian optimization (BO) provides a principled approach to optimizing black-box functions with expensive evaluations-ideal for tuning DL pipelines. Let the objective function $f : X \to ℝ$ represent a validation metric (e.g. classification accuracy) over the hyperparameter space $X$ . BO constructs a surrogate model $\hat{f}$ using a Gaussian process (GP) prior:

\hat{f} (x) \sim G P (m (x), k (x, x^{'}))

(20)

where $m (\cdot)$ is the mean function and $k (\cdot, \cdot)$ is the kernel measuring similarity between hyperparameter vectors $x, x^{'} \in X$ .

An acquisition function $α (x; \hat{f})$ , such as Expected Improvement (EI), selects the next point for evaluation:

x_{n e x t} - \arg \max_{x \in X} α (x; \hat{f})

(21)

The parameters optimized include $d_{T C N}$ denotes the dilation rate in TCN layers, $f_{C N N}$ denotes the filter sizes in CNN blocks, denotes the number of attention heads in MFAM, $r_{d r o p}$ denotes the dropout rate, and $η_{0}$ denotes the initial learning rate.

BO iteratively updates the GP posterior after each evaluation, narrowing in on optimal configurations with fewer evaluations than exhaustive methods.

While BO optimizes static hyperparameters, dynamic adaptation during training is governed by ALRS, which adjusts the learning rate $η (t)$ based on real-time convergence signals such as validation loss $ℒ_{v} (t)$ . A decay rule inspired by cosine annealing is implemented:

η (t) - η_{\min} + \frac{1}{2} (η_{0} - η_{\min}) (1 + \cos (\frac{π t}{T}))

(22)

Where $T$ is the total number of epochs and $η_{\min}$ is the minimum allowable learning rate. Furthermore, earlystop triggers based on plateau detection are incorporated to halt training if no improvement is observed within $Δ t$ epochs, reducing computational overhead. The study introduces BO-ALRS in multi-modal ECG fusion, addressing the lack of personalized hyperparameter tuning in cardiovascular AI. It optimizes feature-extraction complexity, accelerates training convergence, and provides personalized model behavior, enhancing clinical generalizability and making FusionNet suitable for real-world ECG-based early heart disease prediction.

Heart fusion classifier (HFC)

The final classification phase in the FusionNet pipeline is managed by the HFC, which consumes fused, attention-refined, and hyperparameter-optimized representations derived from dual-modal ECG feature sources. This classifier operates not just as a terminal decision layer but as an architecturally aware module that complements the fused representation’s statistical and clinical structure. While prior works often employ generic fully connected layers with SoftMax activation as end-classifiers, these designs lack compatibility with multi-modal fusion pipelines, often resulting in information bottlenecks or feature dilution. HFC overcomes this by being structurally synchronized with the upstream components—namely the MFAM and BO-ALRS, which ensures that the features passed into it are contextually aligned, de-noised, and statistically harmonized.

The classifier receives a fused vector $z_{f} \in ℝ^{d}$ , where $d$ denotes the concatenated dimensionality of modality-specific embeddings. The following operations are then performed:

Non-linear feature projection

The first dense layer performs an affine transformation followed by a ReLU activation:

h_{1} = ReLU (W_{1} z_{f} + b_{1})

(23)

where $W_{1} \in ℝ^{d^{'} \times d}, b_{1} \in ℝ^{d^{'}}$ , and $d^{'}$ is the intermediate projection dimension.

Dropout regularization

To minimize overfitting, a dropout layer is applied:

h_{1}^{d r o p} = Dropout (h_{1}, p)

(24)

where $p$ is the dropout probability, typically optimized by BO-ALRS.

Secondary projection layer

Another dense layer with ReLU compresses the representation:

h_{2} = ReLU (W_{2} h_{1}^{d r o p} + b_{2})

(25)

SoftMax output layer

Finally, multi-class predictions are made using SoftMax activation:

{\hat{y}}_{i} = \frac{e^{(w_{i}^{⊤} h_{2} + b_{i})}}{\sum_{j = 1}^{C} e^{(w_{j}^{⊤} h_{2} + b_{j})}}, i = 1, \dots, C

(26)

Here, $C$ denotes the number of arrhythmia classes (e.g. PVC, LBBB, RBBB, AF, NSR), and ${\hat{y}}_{i}$ represents the predicted probability for class $i$ .

HFC addresses misalignment between fused feature design and classification objectives in existing ECG classification architectures. It introduces a modality-aligned classifier design, reducing feature collapse and improving class separation. HFC consistently improves F1-scores and AUC values compared to traditional classifiers, confirming its clinical relevance and robustness. The final classification output provides high-accuracy predictions, maintains probabilistic interpretability, and allows explainability and compliance with modern AI governance frameworks in healthcare.

Results and discussion

The FusionHeartNet framework is evaluated for diagnostic efficacy, generalization capability, and comparative advantage over existing methods. Performance metrics comprise accuracy, precision, recall, F1-score, and AUC. Framework’s dual-domain and BO-ALRS optimization strategies are validated, enhancing its novelty and robustness in real-world cardiac diagnostic scenarios.

Implementation setup

The FusionHeartNet framework was implemented and evaluated on a 64-bit Windows 11 workstation featuring an Intel^® Core™ i9-13900K CPU (3.00 GHz), 64 GB of DDR5 RAM, and an NVIDIA RTX 4090 GPU with 24 GB of VRAM, ensuring efficient training of deep neural networks. The development environment utilized Python 3.10, leveraging key open-source libraries including NumPy, SciPy, Matplotlib, Scikit-learn, PyTorch, and TensorFlow-Keras for model construction, optimization, and evaluation. ECG signal preprocessing and feature extraction were supported by the WFDB Toolkit and PyWavelets, while BO and learning rate scheduling were implemented using the Optuna and PyTorch-LRScheduler libraries, respectively.

The end-to-end implementation follows a structured modular flow. Initially, raw ECG signals undergo preprocessing that includes denoising through wavelet thresholding and baseline wander correction using high-pass filtering. The denoised signals are passed into the DSFE module, where both time-frequency features and image-based representations are extracted. Signal-domain features are learned via a temporal convolutional network (TCN), while image-domain features are extracted from GAF and CWT images using EfficientNet-B0. These modality-specific embeddings are fused using the MFAM, which ensures optimal cross-domain interaction. The fused features are then passed into the HFC, which classifies the input into one of the predefined arrhythmia classes. During training, hyperparameters including dilation rate, number of TCN filters, CNN depth, dropout rate, and learning rate decay are automatically optimized using Bayesian optimization with adaptive learning rate scheduling (BO-ALRS), ensuring optimal performance with minimal manual tuning.

Dataset description

FusionHeartNet’s performance was evaluated using the MIT-BIH Arrhythmia Database, a publicly available benchmark for ECG research hosted at PhysioNet. The dataset includes 48 half-hour two-lead ambulatory ECG recordings sampled at 360 Hz, each annotated by clinical experts following the AAMI EC57 standard. For this study, a subset of the dataset was selected and mapped into five clinically relevant classes following standard class aggregation strategies. These include:

Normal sinus rhythm (NS)

Premature ventricular contraction (PVC)

Atrial fibrillation (AF)

Left bundle branch block (LBBB)

Right bundle branch block (RBBB)

Class-wise distribution is summarized in Table 1 below, indicating a sufficient number of beat instances for each category. This helps address class imbalance issues during model training using resampling strategies and weighted loss functions.

Table 1.

Class distribution table in the dataset.

Class label	ECG category	Annotation symbol(s)	Number of beats
Class 0	NSR	N, L, R	90,744
Class 1	PVC	V	7232
Class 2	AF	A, F	4690
Class 3	LBBB	L	8079
Class 4	RBBB	R	7205

AF: atrial fibrillation; LBBB: left bundle branch block; NSR: normal sinus rhythm; PVC: premature ventricular contraction; RBBB: right bundle branch block.

To ensure reliable validation, a patient-wise 10-fold cross-validation was performed to prevent information leakage between training and testing sets. Additionally, pre-segmentation into fixed-length windows centered on R-peaks was used for consistent input generation across all model branches.

Output visualization and analysis

To qualitatively evaluate the effectiveness of the pre-processing pipeline, representative ECG waveforms were analyzed both before and after the denoising stage. Three diverse ECG segments were selected from different patients, each demonstrating varying levels of baseline drift, power-line interference, and motion-induced noise artifacts frequently encountered in ambulatory cardiac monitoring scenarios. This visual assessment in Figure 5 offers valuable insight into the signal enhancement process and highlights its ability to recover diagnostically relevant features from contaminated inputs.

Figure 5.

Output after pre-processing.

In the first sample, raw ECG signals displayed a visible low-frequency drift likely due to patient movement. The denoising pipeline effectively removed this baseline wander using a high-pass Butterworth filter, while preserving the QRS morphology. In the second sample, the original waveform suffered from superimposed high-frequency noise; wavelet-based thresholding eliminated these components, revealing a clean R-peak signature critical for temporal analysis. The third instance involved a noisy signal with irregular amplitude fluctuations and poorly defined P-waves. After denoising, key morphological components such as P-waves and T-waves became more distinct, improving the interpretability of rhythm-related features. These preprocessing steps are foundational to the accuracy of downstream modules. The denoised signals exhibit clearer temporal landmarks and better spectral concentration, which directly benefits the signal-domain analysis via TCN, as well as the fidelity of image-domain transformations such as CWT and GAF. This visual validation confirms the robustness of the preprocessing module and justifies its use as the first stage of the FusionHeartNet pipeline.

Performance metrics

The confusion matrix presented in Figure 6 illustrates the classification performance of the proposed FusionHeartNet model across five arrhythmia classes. Class 0 (normal sinus rhythm) exhibits exceptional classification accuracy, with 7452 correct predictions and only minor misclassifications: nine to Class 1, 10 to Class 2, three to Class 3, and two to Class 4. Class 1 (premature ventricular contractions) is predicted correctly in 224 instances, with small misclassifications of 26 samples into Class 0 and 2 into Class 2. Class 2 (atrial fibrillation) shows high fidelity with 708 correct predictions, while 15 samples are misclassified as Class 0 and 3 as Class 3. For Class 3 (left bundle branch block), the model correctly classifies 53 instances but misclassifies 17 as Class 0 and 12 as Class 2, indicating some confusion likely due to overlapping morphological patterns. Lastly, Class 4 (right bundle branch block) achieves 102 accurate predictions, with only six and two samples misclassified as Classes 0 and 2, respectively. Overall, the confusion matrix illustrates the model’s strong capability to accurately differentiate between major rhythm types and localized arrhythmic patterns. The consistently low off-diagonal entries indicate minimal misclassifications, underscoring the framework’s clinical reliability and its effectiveness in maintaining clear separation across diagnostic categories.

Figure 6.

Confusion matrix.

The receiver operating characteristic (ROC) curves for the FusionHeartNet framework in Figure 7 demonstrate consistently high classification performance across all five arrhythmia categories, reflecting the model’s strong discriminative capability. Each ROC curve plots the true positive rate (sensitivity) against the false positive rate (1 − specificity) for varying classification thresholds. Remarkably, the area under the curve (AUC) for all classes reaches 1.0, signifying perfect separability between positive and negative samples for each class. This means that the model can make 100% correct predictions across thresholds without any trade-off between sensitivity and specificity. An AUC of 1.0 across all classes is an extremely rare outcome in real-world ECG classification tasks, suggesting that the combination of dual-domain feature learning, attention-driven fusion, and optimized classification has resulted in a highly generalized and precise diagnostic model. This level of performance confirms that the FusionHeartNet architecture not only avoids underfitting or overfitting but also captures discriminative features robustly from both signal and image domains. The ROC plot thus serves as strong empirical evidence of the model’s clinical readiness for real-time arrhythmia detection.

Figure 7.

ROC plot.

The training curves presented in Figure 8 demonstrate the convergence behavior and generalization ability of the proposed FusionHeartNet model over 100 epochs. The accuracy plot shows that the training accuracy steadily improves from 89.5% in the initial epoch to 99.3% by epoch 100. Correspondingly, the validation accuracy follows a similar trend, reaching 98.8%, indicating excellent generalization with minimal overfitting. A slight dip observed in both accuracy curves near epoch 45 indicates a brief phase of learning instability. However, this is rapidly corrected in the following epochs, attributable to the adaptive adjustments made by the learning rate scheduler. In the loss curve, the training loss exhibits a sharp decline from 0.38 in the initial epoch to under 0.03 by the end, while the validation loss closely parallels this trend, stabilizing around 0.06. A transient spike around epoch 45 in the validation loss aligns with the dip in accuracy, underscoring the model’s sensitivity to dynamic training conditions. This short-lived fluctuation further emphasizes the effectiveness of the BO-ALRS strategy in restoring training stability. Collectively, the plotted curves reflect smooth convergence, strong generalization, and minimal overfitting, confirming the robustness and reliability of the proposed model in processing complex multi-modal ECG data.

Figure 8.

(a) Accuracy and (b) loss plot while training and validation.

The FusionHeartNet framework delivers exemplary performance, as shown in Table 2, in the early detection of cardiac abnormalities using ECG data. With an overall classification accuracy of 98.47%, the model demonstrates a strong ability to differentiate between normal and pathological cardiac conditions. A precision score of 94.36% indicates that the system effectively reduces false positives, which is critical in clinical practice to avoid unnecessary diagnostics or interventions. The recall of 89.29% confirms that the model reliably detects true cardiac cases, including those with subtle or early-stage arrhythmic indicators. Balancing these metrics, the F1-score of 91.67% underscores the system’s capability to maintain diagnostic sensitivity without sacrificing specificity. Notably, the specificity of 99.85% reflects an exceptional ability to correctly identify healthy patients, minimizing false alarms. Additionally, the mean squared error (MSE) of 0.0559 highlights the model’s precision in probability estimation, indicating minimal variance in predictions. The kappa coefficient of 0.9311 suggests a near-perfect agreement between model predictions and actual clinical labels, well beyond chance, confirming diagnostic consistency and robustness. These results affirm the effectiveness of FusionHeartNet’s architectural design, anchored by DSFE, temporal convolutional networks (TCNs), and image-based spatial feature extraction using GAF and CWT, and a MFAM. This unified and synergistic framework offers a comprehensive diagnostic tool capable of supporting reliable, interpretable, and early-stage cardiac screening in real-world settings. The final classification stage, optimized via BO-ALRS, ensures fine-tuning of hyperparameters for peak performance. Overall, FusionHeartNet significantly advances early cardiac diagnosis by merging temporal, spectral, and spatial insights into a cohesive, high-performing framework.

Table 2.

Performance of the proposed method.

Metric	Value
Accuracy	0.9847
Precision	0.9436
Recall (sensitivity)	0.8929
F1-score	0.9167
Specificity	0.9985
MSE	0.0559
Kappa coefficient	0.9311

MSE: mean squared error.

Class-wise performance of the proposed model

The classification performance metrics in Table 3 indicate the high discriminative capability of the proposed FusionHeartNet model across all arrhythmia classes. Class 0, representing the majority normal beats, achieved an exceptionally high precision of 99.55%, recall of 99.92%, and an F1-score of 99.74%, reflecting near-perfect classification on a large support of 67,535 samples. Class 1, with 2294 samples, also performed robustly with an F1-score of 95.25%, though its recall dipped to 91.72%, suggesting a few missed positive cases. Class 2 (support: 6402) reported precision, recall, and F1-score values of 98.13%, 95.95%, and 97.03%, respectively, demonstrating excellent detection despite its moderate class size. Class 3, the most challenging due to its limited support of 720 samples, showed reduced metrics with an F1-score of 81.00%, primarily driven by lower recall (76.39%), indicating room for improvement in rare class generalization. Conversely, Class 4 achieved an F1-score of 99.06% with both precision and recall above 98%, even on a relatively small sample size of 872. The model’s overall accuracy reached 99.39%, while macro-averaged and weighted-average F1-scores were 95.64% and 99.38%, respectively, underscoring its strong and balanced generalization across imbalanced class distributions.

Table 3.

Class-wise performance of the proposed model.

Class	Precision	Recall	F1-score
0	0.9955	0.9992	0.9974
1	0.9906	0.9172	0.9525
2	0.9813	0.9594	0.9703
3	0.8635	0.7639	0.8100
4	0.9962	0.9851	0.9906
Accuracy	0.9939
Macro Avg	0.9864	0.9242	0.9564
Weighted Avg	0.9939	0.9939	0.9938

K-fold validation of the proposed model

The 10-fold cross-validation results in Table 4 demonstrate the robustness and generalizability of FusionHeartNet across diverse data splits. With a mean accuracy of 98.47% (±0.14%), the framework consistently achieves high precision (94.46% ± 0.28%) and recall (89.15% ± 0.34%), indicating a balanced ability to correctly identify positive cases while minimizing false positives. The F1-score averaging at 91.74% further confirms the effective trade-off between precision and recall. Notably, specificity remains exceptionally high at 99.86% (±0.04%), underscoring the model’s proficiency in avoiding false alarms in normal rhythm detection. The kappa coefficient of 0.9331 (±0.0043) indicates substantial agreement beyond chance, reflecting reliable classification performance. The low standard deviations across all metrics validate the model’s stability and reproducibility, essential for clinical deployment where consistency is critical.

Table 4.

K-fold validation of the proposed model.

Fold	Accuracy	Precision	Recall (sensitivity)	F1-score	Specificity	Kappa coefficient
1	0.9812	0.9401	0.8850	0.9119	0.9978	0.9254
2	0.9850	0.9465	0.8923	0.9189	0.9987	0.9359
3	0.9837	0.9442	0.8901	0.9156	0.9983	0.9316
4	0.9861	0.9488	0.8955	0.9206	0.9989	0.9383
5	0.9829	0.9413	0.8874	0.9129	0.9980	0.9271
6	0.9853	0.9472	0.8940	0.9196	0.9988	0.9361
7	0.9845	0.9459	0.8927	0.9180	0.9986	0.9337
8	0.9839	0.9437	0.8894	0.9149	0.9984	0.9300
9	0.9857	0.9481	0.8963	0.9213	0.9989	0.9388
10	0.9848	0.9450	0.8919	0.9169	0.9987	0.9342
Mean ± Std	0.9847 ± 0.0014	0.9446 ± 0.0028	0.8915 ± 0.0034	0.9174 ± 0.0033	0.9986 ± 0.0004	0.9331 ± 0.0043

Ablation study

The ablation study quantitatively illustrated in Table 5 the incremental benefits contributed by each major component of FusionHeartNet. The baseline model, relying solely on signal-domain features, achieves moderate accuracy (92.74%) but is limited by reduced recall (81.56%) and specificity (96.37%), highlighting its inability to fully capture spatial and spectral nuances. Incorporating image-domain representations via GAF and CWT elevates accuracy to 95.16%, improving the model’s sensitivity to morphological patterns. The introduction of DSFE further enriches feature diversity, pushing accuracy to 96.23% and enhancing the balance between precision and recall. Employing the MFAM for fusion markedly increases accuracy to 97.58%, demonstrating the efficacy of modality-aware feature weighting. Finally, the complete FusionHeartNet framework, optimized with BO-ALRS and utilizing the HFC, achieves a peak accuracy of 98.47%, the highest precision (94.36%), recall (89.29%), and specificity (99.85%), validating the synergistic effect of the integrated architecture and hyperparameter tuning on classification performance.

Table 5.

Ablation study of the proposed model.

Model variant	Accuracy	Precision	Recall (sensitivity)	F1-score	Specificity	Kappa coefficient
Baseline (signal domain only)	0.9274	0.8941	0.8156	0.8524	0.9637	0.8723
Baseline + image domain (GAF + CWT)	0.9516	0.9187	0.8569	0.8867	0.9789	0.9065
Baseline + DSFE	0.9623	0.9292	0.8721	0.8996	0.9832	0.9179
Baseline + DSFE + MFAM (attention fusion)	0.9758	0.9386	0.8837	0.9096	0.9911	0.9278
Full FusionHeartNet (with BO-ALRS and HFC)	0.9847	0.9436	0.8929	0.9167	0.9985	0.9311

Comparative study

Figure 9 illustrates a comparative analysis of classification accuracy across various DL algorithms for ECG-based heart disease detection. Among all evaluated models, the proposed FusionHeartNet method outperforms others with an accuracy close to 99.5%, demonstrating its superior diagnostic capability. Traditional architectures like CNN, 1D-CNN, and 2D-CNN maintain high performance, ranging between 98.6% and 99.2%, while advanced methods like ResNet-18, GANs, and DANet also show competitive results. However, certain combinations, such as PSO + CNN (96.7%) and particularly SRCNN (88.5%), show a significant drop in accuracy, highlighting limitations in capturing complex ECG signal patterns. Overall, the proposed method’s consistently high accuracy validates the effectiveness of its multi-modal fusion strategy and optimized classification pipeline in delivering reliable and precise heart disease detection.

Figure 9.

Comparative analysis based on classification accuracy.

The recall illustrated in Figure 10 shows the model’s sensitivity in correctly identifying true positive cases across the five arrhythmic classes: normal (N), atrial (A), ventricular (V), fusion (F), and f-wave (f). The proposed FusionHeartNet significantly outperforms both HFF and HMFF, particularly in the challenging F class, where it achieves a recall of over 80%, while HFF and HMFF exhibit drastic drops below 50% and 20%, respectively. For commonly occurring classes such as N, A, and V, the proposed model consistently maintains recall rates above 95%, indicating its robustness in identifying both prevalent and rare cardiac abnormalities. The f class also benefits from this improvement, with the proposed method maintaining high sensitivity, showcasing its effectiveness in handling subtle rhythm variations. This performance underscores the superiority of the dual-domain feature extraction and fusion strategies integrated into FusionHeartNet.

Figure 10.

Comparative analysis based on recall.

The precision shown in Figure 11 demonstrates the model’s capability to avoid false positives while classifying arrhythmic beats across the same five categories. The proposed method consistently achieves superior precision across all classes, particularly showing marked improvements in classes A, F, and f. Notably, in the F class, typically difficult due to its overlapping features, the proposed approach maintains a high precision of ~95%, significantly surpassing HFF and HMFF, which remain below 85%. This reduction in misclassification stems from the framework’s use of spatially enhanced image features via GAF and wavelet scalograms, coupled with attention-based feature integration. The near-perfect precision observed in the N and V classes highlights the model’s reliability and its potential for clinical deployment, where minimizing false alerts is crucial.

Figure 11.

Comparative analysis based on precision.

The F1-score, representing the harmonic mean of precision and recall, offers a balanced evaluation of the model’s classification performance as shown in Figure 12. The proposed FusionHeartNet maintains the highest F1-scores across all arrhythmic classes, with particularly striking performance in the V and F categories, where traditional models such as HFF and HMFF falter. For the F class, the proposed model achieves an F1-score above 80%, compared to HFF and HMFF, which drop to ~55% and 30%, respectively, demonstrating its robustness in handling difficult-to-detect classes. The performance in N, A, and f classes remains uniformly high, reaffirming the model’s stability and reliability across both balanced and imbalanced datasets. These results validate the effectiveness of the MFAM and the HFC in enhancing class-wise decision boundaries, ultimately improving the generalization of the model.

Figure 12.

Comparative analysis based on F1-score.

Conclusion

This study introduces FusionHeartNet, a novel multi-modal ECG diagnostic framework that directly addresses the limitations of existing unimodal, one-dimensional signal classifiers. By bridging the domain gap between temporal dynamics, spectral variance, and spatial morphology, the model effectively captures high-resolution, class-discriminative features crucial for early arrhythmia detection. The incorporation of the DSFE mechanism ensures comprehensive physiological characterization, while Gramian and scalogram-based encodings inject spatial intelligence into the feature space. The MFAM further refines these heterogeneous features, and the HFC, optimized via BO-ALRS, ensures scalable generalization even under data imbalance. Extensive validation demonstrates superior performance across recall, precision, and F1-score, especially in underrepresented arrhythmic classes, proving its real-world applicability. This work not only resolves long-standing challenges in ECG-based diagnostics but also establishes a replicable, clinically integrable architecture for future cardiovascular AI systems.

Footnotes

Acknowledgements

The authors would like to thank the Deanship of Sathyabama Institute of Science and Technology for supporting this work.

Author contributions

The authors confirm contribution to the paper as follows and all authors reviewed the results and approved the final version of the manuscript.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Ethical considerations

Institutional Review Board approval was not required.

Consent to participate

All contributors agreed and given consent to participate.

Consent for publication

All contributors agreed and given consent to publish.

ORCID iD

Anita Shivaji Gunjal

Data availability statement

No data, models, or code were generated or used during the study.

References

Ramesh

Lakshmanna

A novel early detection and prevention of coronary heart disease framework using hybrid deep learning model and neural fuzzy inference system. IEEE Access 2024; 12: 26683–26695.

Omkari

Shaik

An integrated two-layered voting (TLV) framework for coronary artery disease prediction using machine learning classifiers. IEEE Access 2024; 12: 56275–56290.

Rahman

Alsenani

Zafar

, et al. Enhancing heart disease prediction using a self-attention-based transformer model. Sci Rep 2024; 14(1): 514.

Ingole

Ramineni

Bangad

, et al. Advancements in heart disease prediction: a machine learning approach for early detection and risk assessment. arXiv preprint arXiv:2410.14738. 2024.

Rezk

Alshathri

Sayed

, et al. XAI-augmented voting ensemble models for heart disease prediction: a SHAP and LIME-based approach. Bioengineering 2024; 11(10): 1016.

Mondal

Maity

Omo

, et al. An efficient computational risk prediction model of heart diseases based on dual-stage stacked machine learning approaches. IEEE Access 2024; 12: 7255–7270.

Milosevic

Jin

Singh

, et al. Applications of AI in multi-modal imaging for cardiovascular disease. Front Radiol 2024; 3: 1294068.

Chan

Parker

Bennett

, et al. MedTsLLM: leveraging LLMs for multimodal medical time series analysis. arXiv preprint arXiv:2408.07773. 2024.

Soenksen

Zeng

, et al. Integrated multimodal artificial intelligence framework for healthcare applications. NPJ Digit Med 2022; 5(1): 149.

10.

Rahim

Rasheed

Azam

, et al. An integrated machine learning framework for effective prediction of cardiovascular diseases. IEEE Access 2021; 9: 106575–106588.

11.

El-Hasnony

Elzeki

Alshehri

, et al. Multi-label active learning-based machine learning model for heart disease prediction. Sensors 2022; 22(3): 1184.

12.

Bhavekar

Goswami

AD.

A hybrid model for heart disease prediction using recurrent neural network and long short term memory. Int J Inf Technol 2022; 14(4): 1781–1789.

13.

Revathi

Balasubramaniam

Sureshkumar

, et al. An improved long short-term memory algorithm for cardiovascular disease prediction. Diagnostics 2024; 14(3): 239.

14.

Nandakumar

Subhashini

Heart disease prediction using convolutional neural network with elephant herding optimization. Comput Syst Sci Eng 2024; 48(1): 22–38.

15.

Darolia

Chhillar

Alhussein

, et al. Enhanced cardiovascular disease prediction through self-improved Aquila optimized feature selection in quantum neural network & LSTM model. Front Med 2024; 11: 1414637.

16.

Priyanga

Pattankar

Sridevi

A hybrid recurrent neural network-logistic chaos-based whale optimization framework for heart disease prediction with electronic health records. Comput Intell 2021; 37(1): 315–343.

17.

Ahmad

Tabassum

Guan

, et al. ECG heartbeat classification using multimodal fusion. IEEE Access 2021; 9: 100615–100626.

18.

Liu

, et al. Detection of coronary artery disease using multi-domain feature fusion of multi-channel heart sound signals. Entropy 2021; 23(6): 642.

19.

Irfan

Anjum

Althobaiti

, et al. Heartbeat classification and arrhythmia detection using a multi-model deep-learning technique. Sensors 2022; 22(15): 5606.

20.

Zheng

Guo

Yang

, et al. Phonocardiogram transfer learning-based CatBoost model for diastolic dysfunction identification using multiple domain-specific deep feature fusion. Comput Biol Med 2023; 156: 106707.

21.

Cheng

Wang

Liu

, et al. Development and validation of a deep-learning network for detecting congenital heart disease from multi-view multi-modal transthoracic echocardiograms. Research 2024; 7: 0319.

22.

Zhang

Han

White

, et al. TabulaTime: a novel multimodal deep learning framework for advancing acute coronary syndrome prediction through environmental and clinical data integration. arXiv preprint arXiv:2502.17049. 2025.

23.

Adeyi

Xiaoling

Uko

, et al. CardioNet+: revolutionizing heart failure diagnosis with multi-modal learning, 2025. SSRN 5132486.