Abstract
Electrocardiogram (ECG)-based diagnostics are pivotal in early cardiac disorder detection, yet existing models often fail to integrate temporal, spectral, and spatial dynamics inherent in complex arrhythmic patterns. Most traditional approaches are unimodal, relying either on time-domain signal processing or spatially limited CNN models, thereby overlooking cross-domain dependencies and subtle morphological cues. Addressing this gap, this research proposes FusionHeartNet, a unified deep learning framework that fuses signal- and image-based representations using a dual-spectrum feature embedding (DSFE) strategy. DSFE synergistically extracts morphological descriptors and spectral signatures via Fourier and wavelet transforms, while spatial morphology is preserved through GAF and CWT scalograms. These dual-domain features are refined by a multi-focus attention module (MFAM) and classified through the heart fusion classifier (HFC), which is optimized using Bayesian optimization with adaptive learning rate scheduling (BO-ALRS). Experimental validation on the MIT-BIH Arrhythmia Database demonstrates an accuracy of 98.47%, F1-score of 91.67%, and kappa of 0.9311, significantly outperforming baseline models. FusionHeartNet sets a new benchmark for robust, multi-dimensional ECG analysis, offering clinically viable precision in early heart disease detection.
Keywords
Introduction
Heart failure, which is defined by the heart’s diminished capacity to circulate blood effectively, is a significant global health issue. Conventional diagnostic practices have long been constrained by subjective clinical judgments and a narrow spectrum of data inputs, which have hindered progress in early detection and the development of personalized treatment pathways. 1 The intricate nature of cardiovascular diseases demands a more integrated and detailed diagnostic framework. Relying solely on single-modality data has proven inadequate for capturing the dynamic and multifactorial aspects of cardiac physiology, as increasingly recognized in recent medical literature. 2 In response, contemporary healthcare systems are shifting toward data-intensive diagnostic models that synthesize diverse physiological signals and imaging modalities, aiming to construct a more comprehensive and accurate representation of cardiac health. 3 Recent advances in artificial intelligence and machine learning (ML) are opening transformative avenues in cardiovascular diagnostics. These technologies enable the integration of diverse data streams, such as electrocardiographic (ECG) signals, photoplethysmography (PPG) data, and structural cardiac imaging, into unified diagnostic models. Such multimodal approaches offer a significant leap in accuracy and resilience compared to traditional methods that rely on isolated, single-source inputs. 4
ML and AI aim to develop predictive models by learning from data, emulating certain aspects of human cognitive processes. 5 Historically, conventional ML algorithms have been limited to single-modality data, such as either medical imaging or clinical textual records. This narrow focus diverges from the way humans interpret the world, which inherently involves the simultaneous integration of multiple sensory inputs, like visual cues and auditory signals, to form coherent judgments. 6 To enhance model performance beyond what single-modality approaches can achieve, researchers have increasingly focused on developing techniques that combine various types of data, such as visual and auditory inputs. The core principle behind multimodal ML is that each data type offers unique and complementary insights into a particular event or subject, such as recognizing emotions, identifying objects, or diagnosing illnesses. 7 Multimodal data encompasses diverse formats and sources, and the objective of fusion techniques is to transform these heterogeneous inputs, often differing in scale and distribution, into a unified feature space. This integrated representation can then be utilized for tasks like classification and prediction. 8
In recent years, the volume of cardiac-related data has surged dramatically, posing challenges to effective feature extraction. 9 To address these challenges, deep learning (DL) has emerged as a powerful tool for predicting heart disease. Deep neural networks stand out for their ability to automatically detect and extract features, delivering concise and accurate outcomes, especially in tasks like heartbeat classification. 10 Researchers have integrated a wide range of advanced DL methods to harness this capability, utilizing not only single-modal but also multi-modal and hybrid frameworks. 11 The layered architecture of deep neural networks captures and transforms features at multiple levels, enhancing model performance. Models such as recurrent neural networks (RNN), 12 long short-term memory (LSTM), 13 convolutional neural networks (CNN), 14 and various hybrid combinations 15 have been extensively used to overcome the limitations of traditional ML approaches, which often relied on manual and error-prone feature selection.
Although hybrid models may introduce increased computational cost and face limitations due to insufficient high-quality datasets, these issues are often outweighed by their advantages. Accurate heartbeat classification and reliable arrhythmia detection demand large datasets, making the benefits of DL-driven feature automation far more impactful in real-world applications. 16 The key contributions of this research are outlined below to highlight its novelty, technical rigor, and application significance:
Novel DSFE integrates handcrafted temporal descriptors and frequency-domain features with deep-learned spatial representations to capture complex physiological and morphological characteristics of ECG signals.
Multi-modal signal-image fusion with attention is proposed to enhance cross-domain feature interaction by selectively emphasizing clinically salient patterns across signal and image modalities.
Optimization-driven classification strategy employs BO-ALRS to fine-tune model hyperparameters, ensuring stability, convergence, and diagnostic generalization across arrhythmic classes.
Extensive experimentation on the MIT-BIH Arrhythmia dataset validates the proposed FusionHeartNet, achieving superior accuracy, F1-score, and class-wise robustness, particularly for hard-to-detect arrhythmias.
The structure of the paper is as follows: Section 2 reviews recent research developments and challenges in ECG signal processing and multimodal DL techniques. Section 3 outlines the proposed approach, detailing the FusionHeartNet architecture. Section 4 presents experimental findings, including performance metrics, k-fold cross-validation results, ablation analyses, and relevant visual interpretations. Section 5 concludes the paper by summarizing the main contributions, discussing practical applications, and suggesting future research directions. The “References” section provides a comprehensive list of sources that support the theoretical and technical aspects of this work.
Literature survey
Multi-modal learning is a revolutionary concept in the future of medical diagnostics as compared to the use of only one model to train and analyze data. Ahmad et al. 17 introduced two efficient multimodal fusion strategies for ECG heartbeat classification: multimodal image fusion (MIF) and multimodal feature fusion (MFF). Both approaches transform raw ECG signals into three image types: Gramian angular field (GAF), recurrence plot (RP), and Markov transition field (MTF). The MIF method fuses these images into a single composite image and processes it using a CNN to extract features. In contrast, MFF extracts deep features from the penultimate layers of separate CNNs and combines them to capture both distinct and shared information. These combined features are then used to train an SVM classifier for heartbeat classification. The authors validated both methods on the MIT-BIH Arrhythmia dataset for arrhythmia detection and the PTB diagnostic ECG dataset for myocardial infarction (MI) classification, demonstrating strong performance across multiple conditions.
Liu et al. 18 proposed a method for detecting coronary artery disease (CAD) using multi-domain feature fusion from multi-channel heart sound signals. Their approach combines entropy and cross-entropy features to capture complex signal patterns linked to CAD. The study involved 36 participants, 21 with CAD and 15 healthy controls, with heart sounds recorded over 5 min using a five-channel setup. They extracted features from the time, frequency, entropy, and cross-entropy domains and used the optimal set to train a support vector machine (SVM) classifier. The classification accuracy improved from 78.75% with single-channel data to 86.70% using multi-channel input, and further reached 90.92% after incorporating entropy-based features.
Cardiac arrhythmias remain a serious health concern, with ECG serving as the primary diagnostic tool. However, manual ECG interpretation is often time-consuming and prone to inefficiencies. While machine learning has been widely adopted for ECG analysis, it typically demands lengthy training and manual feature engineering. To address these limitations, Irfan et al. 19 proposed an innovative deep learning framework that stacks similar network layers to form a unified and robust model. Their approach achieved impressive results, with a sensitivity of 98.37%, a specificity of 99.59%, a positive predictive value of 98.41%, and an overall accuracy of 99.35%. These metrics surpassed those of conventional models, both in performance and computational efficiency.
Zheng et al. 20 proposed a transfer learning-based CatBoost model using phonocardiogram (PCG) signals for non-invasive detection of left ventricular diastolic dysfunction. They generated four types of spectrograms to capture key patterns within PCG signals and employed four pre-trained CNNs to extract deep, domain-specific features. To improve classification performance, they applied dimensionality reduction techniques such as principal component analysis (PCA) and linear discriminant analysis (LDA) on selected feature subsets. These features were then fused and used as input to a CatBoost classifier. For comparison, the study also evaluated three additional ML models. The proposed method outperformed all others, achieving an AUC of 0.911, accuracy of 0.882, sensitivity of 0.821, specificity of 0.927, and an F1-score of 0.892.
Cheng et al. 21 analyzed 2D and Doppler transthoracic echocardiography (TTE) images from 1932 pediatric patients across two clinical cohorts at Beijing Children’s Hospital between 2018 and 2022. They developed a deep learning framework that automatically identifies standard cardiac views, integrates multimodal and multiview data, highlights anatomically high-risk regions, and predicts the presence of congenital heart defects (CHDs) such as atrial septal defect (ASD) and ventricular septal defect (VSD). The system achieved a mean accuracy of 0.989 in classifying cardiac views and 0.996 in screening for CHDs using five standardized TTE views across both 2D and Doppler modalities. Additionally, it reached a classification accuracy of 0.991 in both within-center and cross-center evaluations for distinguishing healthy cases from those with ASD or VSD. By leveraging diverse imaging perspectives, the proposed model significantly enhances the precision and accessibility of non-invasive CHD screening in pediatric care.
Acute coronary syndromes (ACS) remain a leading cause of mortality worldwide, yet conventional cardiovascular risk scoring (CVRS) systems frequently neglect environmental contributors such as air pollution. To bridge this gap, Zhang et al. 22 introduced TabulaTime, an advanced multimodal DL framework designed to integrate traditional clinical risk factors with environmental exposure data, particularly air pollution metrics. This approach enhances the predictive capability of ACS risk models by incorporating non-traditional yet impactful determinants of cardiovascular health. TabulaTime introduces three innovations: multimodal feature integration, PatchRWKV for automatic extraction of complex temporal patterns, and enhanced interpretability using attention mechanisms. The experimental evaluation showed TabulaTime achieves a 20.5% improvement in accuracy compared to traditional models, surpassing other models by 20.5%–32.2%. The integration of air pollution data improves accuracy by 10.1%, emphasizing the importance of environmental factors.
The CardioNet+ framework developed by Adeyi et al., 23 is a significant advancement in DL for heart failure detection, combining functional and structural characteristics. It achieved an accuracy rate of 99.1%, an F1-score of 98.3%, and an AUC–ROC of 99.0%, outperforming single-modal models. The framework uses multi-head attention to capture temporal ECG/PPG features and ResNet-50’s pre-trained architecture to learn spatial features from X-ray images. The application of the SMOTE effectively addresses class imbalance, thereby improving model generalizability, particularly for underrepresented categories such as heart failure cases. To facilitate clinical interpretability and trust, the framework incorporates Grad-CAM for visual explanation of chest X-ray predictions and attention-based heatmaps for ECG and PPG signals. These interpretability mechanisms support transparent decision-making and promote seamless integration of the model into routine clinical workflows.
Proposed methodology
ECG signals are a cornerstone in cardiovascular diagnostics, providing a non-invasive means of capturing detailed physiological and pathological information related to cardiac function. Acknowledging the shortcomings of traditional diagnostic methods, this study presents FusionHeartNet, an innovative and unified deep learning framework specifically designed for the early detection of heart disease. Hence, leveraging multimodal DL architecture, the proposed system enables precise and timely classification of arrhythmias, as illustrated in Figure 1. FusionHeartNet addresses both data heterogeneity and diagnostic complexity, offering a robust solution for enhancing clinical decision-making in cardiology.

Architecture of the proposed framework.
Traditional electrocardiographic analysis techniques, often confined to either time-domain or frequency-domain processing, frequently fall short in capturing the intricate interplay between temporal transitions and spectral variations, both of which are vital for detecting nuanced, early-stage arrhythmic signatures. To bridge this gap, the proposed FusionHeartNet framework introduces a DSFE strategy that extracts complementary time-based morphological indicators (e.g. QRS duration, R–R intervals) alongside frequency-domain descriptors derived from Fourier and wavelet transformations. These enriched features are then modeled using a temporal convolutional network (TCN), which effectively captures long-range dependencies and subtle temporal fluctuations inherent in ECG signals. A further shortcoming of conventional approaches lies in their predominantly one-dimensional analytical perspective, which often disregards spatial morphology, an essential factor in enhancing pathological pattern recognition. To address this, FusionHeartNet transforms ECG signals into image-based representations using GAF and CWT scalograms. These spatial encodings are subsequently analyzed by EfficientNet-B0, a deep convolutional neural network that excels in identifying fine-grained visual patterns and transient anomalies.
Moreover, existing methodologies frequently suffer from the absence of a cohesive fusion mechanism to integrate multi-domain features, leading to fragmented interpretations and reduced diagnostic reliability. FusionHeartNet overcomes this challenge by incorporating a multi-focus attention module (MFAM), which leverages a multi-head attention architecture to dynamically prioritize the most salient features across both signal-based and image-based modalities. The final diagnostic decision is made through the HFC, an optimized ensemble classification network whose performance is enhanced using BO-ALRS. This process optimizes hyperparameters, including learning rate, dropout rate, and attention depth, to enhance model generalization and robustness. In summary, FusionHeartNet represents a comprehensive and unified diagnostic paradigm that seamlessly integrates temporal, spectral, and spatial feature domains. By doing so, it significantly advances the state of automated cardiovascular diagnostics, enabling accurate and early detection of heart disease with high clinical relevance.
Dual-spectrum feature embedding (DSFE)
Conventional ECG classification methods often rely on either time-domain or frequency-domain analysis in isolation, which limits their ability to capture the intricate interdependencies between temporal dynamics and spectral shifts—especially critical in detecting transient or evolving arrhythmic episodes. Moreover, these approaches typically treat ECG signals as one-dimensional time series, thereby overlooking morphological characteristics that are more effectively represented and interpreted through spatially structured formats. This limitation restricts the capacity of such models to recognize complex patterns indicative of pathological states.
The DSFE strategy mitigates these shortcomings by employing a bimodal feature extraction approach, wherein the ECG signal is concurrently analyzed as a raw time-series waveform in the signal domain and as a transformed spatial representation in the image domain, as illustrated in Figure 2. This approach captures temporal-frequency descriptors alongside morphological textures, generating a hybrid representation that forms a dual-spectrum embedding. As a result, the framework facilitates comprehensive learning from diverse physiological signal characteristics, improving the detection of subtle and complex cardiac anomalies.

Flow diagram of the proposed DSFE method.
Signal domain representation
The signal domain processing of the proposed DSFE module is crafted to extract diagnostically relevant information by blending handcrafted physiological descriptors with learned deep features. This dual-pathway design addresses a notable limitation in existing methods that often rely solely on either rule-based feature extraction or end-to-end learning, failing to incorporate domain knowledge into the learning loop.
Let
QRS duration
These handcrafted features form the morphological feature vector:
Here,
For complementing temporal morphology, frequency-based features are extracted using two orthogonal signal decomposition techniques:
The process begins by applying the short-time Fourier transform (STFT) to extract spectral energy localized across both time and frequency domains. Then signal is windowed by a function
Yielding a 2D time-frequency energy map
The discrete wavelet transform (DWT) subsequently decomposes the signal into multiple frequency sub-bands through multiresolution analysis.
Where
A temporal convolutional network (TCN) processes the fused temporal and frequency-based features, leveraging causal and dilated convolutions to model long-range sequential dependencies with high temporal fidelity. For a given input sequence
Where
The dilation factor
The final signal domain embedding is:
Where
While the signal domain captures temporal-spectral patterns as given in Figure 3, essential for functional diagnosis, spatial image representations more effectively reveal structural and morphological distortions in the ECG, such as waveform bifurcations and baseline drifts. Hence, the next module focuses on image-domain embeddings derived from signal-to-image transformations.

Process flow for signal domain-based features.
Image domain representation
The ECG signal is transformed into two-dimensional image representations, and two complementary techniques are employed to recover spatial and morphological details that are lost in one-dimensional sequences. This study introduces a dual-image encoding strategy to address the critical gap in recognizing spatially distributed morphological distortions, such as waveform bifurcations, baseline wander, and arrhythmic spikes that are often undetectable in traditional one-dimensional signal analysis.
This image-based module, given in Figure 4, augments temporal signal features by embedding ECG signals in a two-dimensional space, enabling advanced convolutional feature extraction and enhanced visual pattern learning. Let the pre-processed ECG signal be denoted as

Process flow diagram of the image domain-based feature module.
The GAF transformation maps the normalized time series into a polar coordinate system, defined by:
The Gramian angular summation field is then constructed as:
This operation preserves the temporal dependency and encodes correlations in angular space, producing a symmetric image
Simultaneously, CWT scalograms are generated to provide a time-frequency representation that effectively captures frequency variations and short-duration spikes. The CWT, based on a chosen mother wavelet
Where
The resulting scalogram matrix
The GAF image
This controlled scaling mechanism preserves computational efficiency while enabling the model to effectively learn complex hierarchical features. Then, all input images are passed through a convolutional backbone, resulting in the generation of latent feature maps:
These representations are then concatenated as follows:
Where ∥ parallel denotes vector concatenation and
This dual-view embedding overcomes the limitation of existing image-based methods that typically depend on a single image modality, which often fails to fully utilize both phase and frequency information. By combining GAF and CWT, the approach captures the ECG signal’s spatiotemporal and spectral morphology from multiple perspectives, thereby enhancing the accuracy of classifying complex arrhythmic patterns. Having derived a comprehensive representation from both signal and image domains, the next logical step is the effective integration of these heterogeneous features. The following subsection describes the multi-modal fusion and classification pipeline, FusionNet, which synergistically aligns and discriminates these complementary embeddings for robust early detection.
FusionNet
After obtaining dual-stream features from the signal and image domains, FusionNet performs alignment, selection, optimization, and classification of these heterogeneous features in an end-to-end trainable pipeline. Most existing models use simple concatenation to fuse multimodal features, without modeling their relative importance or resolving dimensional misalignment. Furthermore, these models lack a principled strategy for hyperparameter tuning, leading to non-optimal configurations and overfitting, especially in limited medical datasets. FusionNet combines multimodal attention-based fusion, Bayesian-guided hyperparameter tuning, and a deep neural classification architecture. This integration facilitates both effective feature fusion and adaptive learning, dynamically aligning with the relevance of input features and the size of the dataset.
Multi-focus attention module (MFAM)
Integrating heterogeneous features from both signal and image domains remains a critical challenge in multimodal ECG analysis. Traditional fusion strategies, such as direct concatenation or fixed-weight averaging, often overlook the distinct significance and complex interrelations inherent to each modality. As a result, they tend to introduce redundant information and compromise diagnostic precision. This limitation underscores the need for a more sophisticated fusion mechanism, one capable of adaptively highlighting the most salient features from each domain while maintaining their complementary synergy.
The MFAM addresses this critical limitation by employing a multi-head self-attention mechanism designed to independently and jointly learn attention maps over signal-domain features
Each attention head
where
The scaled dot-product attention for each head is computed as:
Here,
The multi-head attention features are then concatenated across all heads, projected via
The MFAM architecture is designed to capture diverse regions of importance across signal and image modalities, effectively modeling intricate dependencies, such as temporal fluctuations that correspond with localized spatial anomalies. In contrast to simplistic fusion approaches that uniformly weight all inputs, MFAM employs domain-aware adaptive weighting, dynamically adjusting attention scores based on modality-specific relevance. This enables the model to downregulate noisy or conflicting features from one domain when stronger, corroborating signals are present in the other. Furthermore, it strengthens cross-modal synergy by aligning temporally significant signal patterns with spatial anomalies in the image domain. Importantly, MFAM maintains the unique discriminative strengths of each modality while learning their interrelationships, an aspect often neglected by conventional fusion techniques.
This results in a robust fused feature vector Ffused that maximizes informative content and minimizes interference.
After feature fusion, the fused feature set
To achieve optimal learning efficiency, key hyperparameters, including learning rate, dropout rate, convolutional filter dimensions, and the number of attention heads, are systematically optimized using BO-ALRS. This adaptive tuning framework addresses the inefficiencies of manual hyperparameter selection, promoting faster convergence and improved model generalization. By uniting domain-aware attention fusion with automated classifier optimization, FusionNet establishes a robust and unified pipeline that fully exploits the richness of multimodal ECG features, thereby enabling more accurate and earlier detection of cardiac abnormalities.
BO-ALRS hyperparameter optimization
Deep architectures like FusionNet, which process multimodal inputs from both signal and image domains, involve a highly complex hyperparameter space. Parameters such as dilation rates in TCNs, convolutional kernel sizes, number of attention heads, dropout rates, and learning rate schedules all play critical roles in determining model convergence, generalization capacity, and diagnostic performance. Conventional tuning strategies such as manual adjustment, grid search, or random sampling are often computationally intensive and overlook the intricate interdependencies among these parameters. To overcome this, the present study introduces a BO-ALRS approach, a dual-stage optimization mechanism, that is, both data-driven and sensitive to convergence dynamics, thereby addressing a key shortcoming in existing model optimization practices.
Bayesian optimization (BO) provides a principled approach to optimizing black-box functions with expensive evaluations-ideal for tuning DL pipelines. Let the objective function
where
An acquisition function
The parameters optimized include
BO iteratively updates the GP posterior after each evaluation, narrowing in on optimal configurations with fewer evaluations than exhaustive methods.
While BO optimizes static hyperparameters, dynamic adaptation during training is governed by ALRS, which adjusts the learning rate
Where
Heart fusion classifier (HFC)
The final classification phase in the FusionNet pipeline is managed by the HFC, which consumes fused, attention-refined, and hyperparameter-optimized representations derived from dual-modal ECG feature sources. This classifier operates not just as a terminal decision layer but as an architecturally aware module that complements the fused representation’s statistical and clinical structure. While prior works often employ generic fully connected layers with SoftMax activation as end-classifiers, these designs lack compatibility with multi-modal fusion pipelines, often resulting in information bottlenecks or feature dilution. HFC overcomes this by being structurally synchronized with the upstream components—namely the MFAM and BO-ALRS, which ensures that the features passed into it are contextually aligned, de-noised, and statistically harmonized.
The classifier receives a fused vector
Non-linear feature projection
The first dense layer performs an affine transformation followed by a ReLU activation:
where
Dropout regularization
To minimize overfitting, a dropout layer is applied:
where
Secondary projection layer
Another dense layer with ReLU compresses the representation:
SoftMax output layer
Finally, multi-class predictions are made using SoftMax activation:
Here,
HFC addresses misalignment between fused feature design and classification objectives in existing ECG classification architectures. It introduces a modality-aligned classifier design, reducing feature collapse and improving class separation. HFC consistently improves F1-scores and AUC values compared to traditional classifiers, confirming its clinical relevance and robustness. The final classification output provides high-accuracy predictions, maintains probabilistic interpretability, and allows explainability and compliance with modern AI governance frameworks in healthcare.
Results and discussion
The FusionHeartNet framework is evaluated for diagnostic efficacy, generalization capability, and comparative advantage over existing methods. Performance metrics comprise accuracy, precision, recall, F1-score, and AUC. Framework’s dual-domain and BO-ALRS optimization strategies are validated, enhancing its novelty and robustness in real-world cardiac diagnostic scenarios.
Implementation setup
The FusionHeartNet framework was implemented and evaluated on a 64-bit Windows 11 workstation featuring an Intel® Core™ i9-13900K CPU (3.00 GHz), 64 GB of DDR5 RAM, and an NVIDIA RTX 4090 GPU with 24 GB of VRAM, ensuring efficient training of deep neural networks. The development environment utilized Python 3.10, leveraging key open-source libraries including NumPy, SciPy, Matplotlib, Scikit-learn, PyTorch, and TensorFlow-Keras for model construction, optimization, and evaluation. ECG signal preprocessing and feature extraction were supported by the WFDB Toolkit and PyWavelets, while BO and learning rate scheduling were implemented using the Optuna and PyTorch-LRScheduler libraries, respectively.
The end-to-end implementation follows a structured modular flow. Initially, raw ECG signals undergo preprocessing that includes denoising through wavelet thresholding and baseline wander correction using high-pass filtering. The denoised signals are passed into the DSFE module, where both time-frequency features and image-based representations are extracted. Signal-domain features are learned via a temporal convolutional network (TCN), while image-domain features are extracted from GAF and CWT images using EfficientNet-B0. These modality-specific embeddings are fused using the MFAM, which ensures optimal cross-domain interaction. The fused features are then passed into the HFC, which classifies the input into one of the predefined arrhythmia classes. During training, hyperparameters including dilation rate, number of TCN filters, CNN depth, dropout rate, and learning rate decay are automatically optimized using Bayesian optimization with adaptive learning rate scheduling (BO-ALRS), ensuring optimal performance with minimal manual tuning.
Dataset description
FusionHeartNet’s performance was evaluated using the MIT-BIH Arrhythmia Database, a publicly available benchmark for ECG research hosted at PhysioNet. The dataset includes 48 half-hour two-lead ambulatory ECG recordings sampled at 360 Hz, each annotated by clinical experts following the AAMI EC57 standard. For this study, a subset of the dataset was selected and mapped into five clinically relevant classes following standard class aggregation strategies. These include:
Normal sinus rhythm (NS)
Premature ventricular contraction (PVC)
Atrial fibrillation (AF)
Left bundle branch block (LBBB)
Right bundle branch block (RBBB)
Class-wise distribution is summarized in Table 1 below, indicating a sufficient number of beat instances for each category. This helps address class imbalance issues during model training using resampling strategies and weighted loss functions.
Class distribution table in the dataset.
AF: atrial fibrillation; LBBB: left bundle branch block; NSR: normal sinus rhythm; PVC: premature ventricular contraction; RBBB: right bundle branch block.
To ensure reliable validation, a patient-wise 10-fold cross-validation was performed to prevent information leakage between training and testing sets. Additionally, pre-segmentation into fixed-length windows centered on R-peaks was used for consistent input generation across all model branches.
Output visualization and analysis
To qualitatively evaluate the effectiveness of the pre-processing pipeline, representative ECG waveforms were analyzed both before and after the denoising stage. Three diverse ECG segments were selected from different patients, each demonstrating varying levels of baseline drift, power-line interference, and motion-induced noise artifacts frequently encountered in ambulatory cardiac monitoring scenarios. This visual assessment in Figure 5 offers valuable insight into the signal enhancement process and highlights its ability to recover diagnostically relevant features from contaminated inputs.

Output after pre-processing.
In the first sample, raw ECG signals displayed a visible low-frequency drift likely due to patient movement. The denoising pipeline effectively removed this baseline wander using a high-pass Butterworth filter, while preserving the QRS morphology. In the second sample, the original waveform suffered from superimposed high-frequency noise; wavelet-based thresholding eliminated these components, revealing a clean R-peak signature critical for temporal analysis. The third instance involved a noisy signal with irregular amplitude fluctuations and poorly defined P-waves. After denoising, key morphological components such as P-waves and T-waves became more distinct, improving the interpretability of rhythm-related features. These preprocessing steps are foundational to the accuracy of downstream modules. The denoised signals exhibit clearer temporal landmarks and better spectral concentration, which directly benefits the signal-domain analysis via TCN, as well as the fidelity of image-domain transformations such as CWT and GAF. This visual validation confirms the robustness of the preprocessing module and justifies its use as the first stage of the FusionHeartNet pipeline.
Performance metrics
The confusion matrix presented in Figure 6 illustrates the classification performance of the proposed FusionHeartNet model across five arrhythmia classes. Class 0 (normal sinus rhythm) exhibits exceptional classification accuracy, with 7452 correct predictions and only minor misclassifications: nine to Class 1, 10 to Class 2, three to Class 3, and two to Class 4. Class 1 (premature ventricular contractions) is predicted correctly in 224 instances, with small misclassifications of 26 samples into Class 0 and 2 into Class 2. Class 2 (atrial fibrillation) shows high fidelity with 708 correct predictions, while 15 samples are misclassified as Class 0 and 3 as Class 3. For Class 3 (left bundle branch block), the model correctly classifies 53 instances but misclassifies 17 as Class 0 and 12 as Class 2, indicating some confusion likely due to overlapping morphological patterns. Lastly, Class 4 (right bundle branch block) achieves 102 accurate predictions, with only six and two samples misclassified as Classes 0 and 2, respectively. Overall, the confusion matrix illustrates the model’s strong capability to accurately differentiate between major rhythm types and localized arrhythmic patterns. The consistently low off-diagonal entries indicate minimal misclassifications, underscoring the framework’s clinical reliability and its effectiveness in maintaining clear separation across diagnostic categories.

Confusion matrix.
The receiver operating characteristic (ROC) curves for the FusionHeartNet framework in Figure 7 demonstrate consistently high classification performance across all five arrhythmia categories, reflecting the model’s strong discriminative capability. Each ROC curve plots the true positive rate (sensitivity) against the false positive rate (1 − specificity) for varying classification thresholds. Remarkably, the area under the curve (AUC) for all classes reaches 1.0, signifying perfect separability between positive and negative samples for each class. This means that the model can make 100% correct predictions across thresholds without any trade-off between sensitivity and specificity. An AUC of 1.0 across all classes is an extremely rare outcome in real-world ECG classification tasks, suggesting that the combination of dual-domain feature learning, attention-driven fusion, and optimized classification has resulted in a highly generalized and precise diagnostic model. This level of performance confirms that the FusionHeartNet architecture not only avoids underfitting or overfitting but also captures discriminative features robustly from both signal and image domains. The ROC plot thus serves as strong empirical evidence of the model’s clinical readiness for real-time arrhythmia detection.

ROC plot.
The training curves presented in Figure 8 demonstrate the convergence behavior and generalization ability of the proposed FusionHeartNet model over 100 epochs. The accuracy plot shows that the training accuracy steadily improves from 89.5% in the initial epoch to 99.3% by epoch 100. Correspondingly, the validation accuracy follows a similar trend, reaching 98.8%, indicating excellent generalization with minimal overfitting. A slight dip observed in both accuracy curves near epoch 45 indicates a brief phase of learning instability. However, this is rapidly corrected in the following epochs, attributable to the adaptive adjustments made by the learning rate scheduler. In the loss curve, the training loss exhibits a sharp decline from 0.38 in the initial epoch to under 0.03 by the end, while the validation loss closely parallels this trend, stabilizing around 0.06. A transient spike around epoch 45 in the validation loss aligns with the dip in accuracy, underscoring the model’s sensitivity to dynamic training conditions. This short-lived fluctuation further emphasizes the effectiveness of the BO-ALRS strategy in restoring training stability. Collectively, the plotted curves reflect smooth convergence, strong generalization, and minimal overfitting, confirming the robustness and reliability of the proposed model in processing complex multi-modal ECG data.

(a) Accuracy and (b) loss plot while training and validation.
The FusionHeartNet framework delivers exemplary performance, as shown in Table 2, in the early detection of cardiac abnormalities using ECG data. With an overall classification accuracy of 98.47%, the model demonstrates a strong ability to differentiate between normal and pathological cardiac conditions. A precision score of 94.36% indicates that the system effectively reduces false positives, which is critical in clinical practice to avoid unnecessary diagnostics or interventions. The recall of 89.29% confirms that the model reliably detects true cardiac cases, including those with subtle or early-stage arrhythmic indicators. Balancing these metrics, the F1-score of 91.67% underscores the system’s capability to maintain diagnostic sensitivity without sacrificing specificity. Notably, the specificity of 99.85% reflects an exceptional ability to correctly identify healthy patients, minimizing false alarms. Additionally, the mean squared error (MSE) of 0.0559 highlights the model’s precision in probability estimation, indicating minimal variance in predictions. The kappa coefficient of 0.9311 suggests a near-perfect agreement between model predictions and actual clinical labels, well beyond chance, confirming diagnostic consistency and robustness. These results affirm the effectiveness of FusionHeartNet’s architectural design, anchored by DSFE, temporal convolutional networks (TCNs), and image-based spatial feature extraction using GAF and CWT, and a MFAM. This unified and synergistic framework offers a comprehensive diagnostic tool capable of supporting reliable, interpretable, and early-stage cardiac screening in real-world settings. The final classification stage, optimized via BO-ALRS, ensures fine-tuning of hyperparameters for peak performance. Overall, FusionHeartNet significantly advances early cardiac diagnosis by merging temporal, spectral, and spatial insights into a cohesive, high-performing framework.
Performance of the proposed method.
MSE: mean squared error.
Class-wise performance of the proposed model
The classification performance metrics in Table 3 indicate the high discriminative capability of the proposed FusionHeartNet model across all arrhythmia classes. Class 0, representing the majority normal beats, achieved an exceptionally high precision of 99.55%, recall of 99.92%, and an F1-score of 99.74%, reflecting near-perfect classification on a large support of 67,535 samples. Class 1, with 2294 samples, also performed robustly with an F1-score of 95.25%, though its recall dipped to 91.72%, suggesting a few missed positive cases. Class 2 (support: 6402) reported precision, recall, and F1-score values of 98.13%, 95.95%, and 97.03%, respectively, demonstrating excellent detection despite its moderate class size. Class 3, the most challenging due to its limited support of 720 samples, showed reduced metrics with an F1-score of 81.00%, primarily driven by lower recall (76.39%), indicating room for improvement in rare class generalization. Conversely, Class 4 achieved an F1-score of 99.06% with both precision and recall above 98%, even on a relatively small sample size of 872. The model’s overall accuracy reached 99.39%, while macro-averaged and weighted-average F1-scores were 95.64% and 99.38%, respectively, underscoring its strong and balanced generalization across imbalanced class distributions.
Class-wise performance of the proposed model.
K-fold validation of the proposed model
The 10-fold cross-validation results in Table 4 demonstrate the robustness and generalizability of FusionHeartNet across diverse data splits. With a mean accuracy of 98.47% (±0.14%), the framework consistently achieves high precision (94.46% ± 0.28%) and recall (89.15% ± 0.34%), indicating a balanced ability to correctly identify positive cases while minimizing false positives. The F1-score averaging at 91.74% further confirms the effective trade-off between precision and recall. Notably, specificity remains exceptionally high at 99.86% (±0.04%), underscoring the model’s proficiency in avoiding false alarms in normal rhythm detection. The kappa coefficient of 0.9331 (±0.0043) indicates substantial agreement beyond chance, reflecting reliable classification performance. The low standard deviations across all metrics validate the model’s stability and reproducibility, essential for clinical deployment where consistency is critical.
K-fold validation of the proposed model.
Ablation study
The ablation study quantitatively illustrated in Table 5 the incremental benefits contributed by each major component of FusionHeartNet. The baseline model, relying solely on signal-domain features, achieves moderate accuracy (92.74%) but is limited by reduced recall (81.56%) and specificity (96.37%), highlighting its inability to fully capture spatial and spectral nuances. Incorporating image-domain representations via GAF and CWT elevates accuracy to 95.16%, improving the model’s sensitivity to morphological patterns. The introduction of DSFE further enriches feature diversity, pushing accuracy to 96.23% and enhancing the balance between precision and recall. Employing the MFAM for fusion markedly increases accuracy to 97.58%, demonstrating the efficacy of modality-aware feature weighting. Finally, the complete FusionHeartNet framework, optimized with BO-ALRS and utilizing the HFC, achieves a peak accuracy of 98.47%, the highest precision (94.36%), recall (89.29%), and specificity (99.85%), validating the synergistic effect of the integrated architecture and hyperparameter tuning on classification performance.
Ablation study of the proposed model.
Comparative study
Figure 9 illustrates a comparative analysis of classification accuracy across various DL algorithms for ECG-based heart disease detection. Among all evaluated models, the proposed FusionHeartNet method outperforms others with an accuracy close to 99.5%, demonstrating its superior diagnostic capability. Traditional architectures like CNN, 1D-CNN, and 2D-CNN maintain high performance, ranging between 98.6% and 99.2%, while advanced methods like ResNet-18, GANs, and DANet also show competitive results. However, certain combinations, such as PSO + CNN (96.7%) and particularly SRCNN (88.5%), show a significant drop in accuracy, highlighting limitations in capturing complex ECG signal patterns. Overall, the proposed method’s consistently high accuracy validates the effectiveness of its multi-modal fusion strategy and optimized classification pipeline in delivering reliable and precise heart disease detection.

Comparative analysis based on classification accuracy.
The recall illustrated in Figure 10 shows the model’s sensitivity in correctly identifying true positive cases across the five arrhythmic classes: normal (N), atrial (A), ventricular (V), fusion (F), and f-wave (f). The proposed FusionHeartNet significantly outperforms both HFF and HMFF, particularly in the challenging F class, where it achieves a recall of over 80%, while HFF and HMFF exhibit drastic drops below 50% and 20%, respectively. For commonly occurring classes such as N, A, and V, the proposed model consistently maintains recall rates above 95%, indicating its robustness in identifying both prevalent and rare cardiac abnormalities. The f class also benefits from this improvement, with the proposed method maintaining high sensitivity, showcasing its effectiveness in handling subtle rhythm variations. This performance underscores the superiority of the dual-domain feature extraction and fusion strategies integrated into FusionHeartNet.

Comparative analysis based on recall.
The precision shown in Figure 11 demonstrates the model’s capability to avoid false positives while classifying arrhythmic beats across the same five categories. The proposed method consistently achieves superior precision across all classes, particularly showing marked improvements in classes A, F, and f. Notably, in the F class, typically difficult due to its overlapping features, the proposed approach maintains a high precision of ~95%, significantly surpassing HFF and HMFF, which remain below 85%. This reduction in misclassification stems from the framework’s use of spatially enhanced image features via GAF and wavelet scalograms, coupled with attention-based feature integration. The near-perfect precision observed in the N and V classes highlights the model’s reliability and its potential for clinical deployment, where minimizing false alerts is crucial.

Comparative analysis based on precision.
The F1-score, representing the harmonic mean of precision and recall, offers a balanced evaluation of the model’s classification performance as shown in Figure 12. The proposed FusionHeartNet maintains the highest F1-scores across all arrhythmic classes, with particularly striking performance in the V and F categories, where traditional models such as HFF and HMFF falter. For the F class, the proposed model achieves an F1-score above 80%, compared to HFF and HMFF, which drop to ~55% and 30%, respectively, demonstrating its robustness in handling difficult-to-detect classes. The performance in N, A, and f classes remains uniformly high, reaffirming the model’s stability and reliability across both balanced and imbalanced datasets. These results validate the effectiveness of the MFAM and the HFC in enhancing class-wise decision boundaries, ultimately improving the generalization of the model.

Comparative analysis based on F1-score.
Conclusion
This study introduces FusionHeartNet, a novel multi-modal ECG diagnostic framework that directly addresses the limitations of existing unimodal, one-dimensional signal classifiers. By bridging the domain gap between temporal dynamics, spectral variance, and spatial morphology, the model effectively captures high-resolution, class-discriminative features crucial for early arrhythmia detection. The incorporation of the DSFE mechanism ensures comprehensive physiological characterization, while Gramian and scalogram-based encodings inject spatial intelligence into the feature space. The MFAM further refines these heterogeneous features, and the HFC, optimized via BO-ALRS, ensures scalable generalization even under data imbalance. Extensive validation demonstrates superior performance across recall, precision, and F1-score, especially in underrepresented arrhythmic classes, proving its real-world applicability. This work not only resolves long-standing challenges in ECG-based diagnostics but also establishes a replicable, clinically integrable architecture for future cardiovascular AI systems.
Footnotes
Acknowledgements
The authors would like to thank the Deanship of Sathyabama Institute of Science and Technology for supporting this work.
Author contributions
The authors confirm contribution to the paper as follows and all authors reviewed the results and approved the final version of the manuscript.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Ethical considerations
Institutional Review Board approval was not required.
Consent to participate
All contributors agreed and given consent to participate.
Consent for publication
All contributors agreed and given consent to publish.
Data availability statement
No data, models, or code were generated or used during the study.
