Classification of serum protein and immunofixation electrophoresis images by computer vision-based deep learning models: A single- and dual-modality study

Abstract

Objectives

The objective of this study was to develop deep learning models for the automated classification of serum protein electrophoresis (SPE) and immunofixation electrophoresis (IFE) images into oncologic, non-oncologic, and healthy categories, and to compare the predictive performance of single- and dual-modality approaches.

Methods

We retrospectively collected SPE and IFE images from 1,919 patients who underwent both tests at Kartal Dr Lütfi Kırdar City Hospital. MobileNetV2-based models were developed using transfer learning. Single-modality (SPE or IFE) and dual-modality (SPE and IFE) models were trained, and their performance was evaluated using accuracy, precision, recall, specificity, F1-score, and ROC-AUC. Grad-CAM visualizations were generated to assess model interpretability.

Results

The dual-modality model achieved higher accuracy and robustness compared to single-modality models. Oncologic cases were detected with near-perfect recall and ROC-AUC, while single-modality models demonstrated moderate performance in distinguishing non-oncologic and healthy groups. Grad-CAM outputs confirmed that the models focused on diagnostically relevant electrophoretic bands.

Conclusion

Deep learning models can reliably classify electrophoresis images into oncologic, non-oncologic, and healthy categories. Combining SPE and IFE improves diagnostic performance and may assist laboratory specialists and hematologists in reducing subjectivity, particularly in borderline cases. Validation in larger, multicenter cohorts is warranted prior to clinical implementation.

Keywords

Electrophoresis artificial intelligence computer vision deep learning

Introduction

Serum protein electrophoresis (SPE) separates major serum proteins into characteristic bands—albumin and the alpha, beta, and gamma globulin—by applying an electric field, which drives migration according to each protein’s size and net charge.¹ It is frequently employed in the diagnosis of monoclonal or polyclonal gammopathies, such as multiple myeloma (MM), Waldenström’s macroglobulinemia (WM), protein-losing enteropathies, chronic inflammatory conditions, liver diseases, and nephrotic syndrome.¹ Immunofixation electrophoresis (IFE) builds on similar affect by selectively binding antibodies to immunoglobulins in serum, revealing the presence and type of monoclonal proteins, including heavy (IgG, IgA, IgM) and light chains (kappa, lambda), allowing precise identification of monoclonal components. It is primarily utilized to determine the type of immunoglobulin responsible for a monoclonal band observed on SPE. IFE holds significant clinical value in diagnosing various hemato-oncological disorders, particularly MM and WM.^2,3

Although the principles of SPE and IFE are long established, they remain central in modern hematology diagnostics. However, the accurate interpretation is dependent upon the proficiency of laboratory specialists, highlighting the critical role of experience. This reliance on subjective assessment may introduce variability in result interpretation, thereby increasing the margin of error and potentially leading to clinical misjudgments. Furthermore, reliance on expert evaluation compromises laboratory workflow efficiency, leading to prolonged turnaround times and increased workload.^1,3,4 One notable interpretive challenge is analytical interference. Electrophoretic patterns can be distorted by interfering substances, ranging from endogenous proteins like hemoglobin and fibrinogen to external agents such as contrast media, antibiotics, or therapeutic antibodies. These interferences may mimic monoclonal bands and complicate interpretation.^4,5

In recent years, artificial intelligence (AI) techniques—including deep learning and computer vision—have been increasingly integrated into various domains of medicine. These technologies have shown particular promise in diagnostic workflows, radiological and pathological image analysis, and clinical decision support systems.^6,7,8 Their adoption has been associated with improved diagnostic accuracy, reduced workload, and enhanced operational efficiency.^9,10,11,12 Despite these advances, the application of AI to electrophoresis interpretation remains limited in the current literature.

Emerging studies have begun to explore AI integration with electrophoresis. Shen et al. developed a MobileNetV2-based convolutional neural network (CNN) to detect low-concentration M-proteins in SPE, successfully identifying M-spikes often missed during manual evaluation.¹³ Chabrun et al. introduced the SPECTR system, comprising four deep learning models for fractionation, anomaly detection, peak localization, and hemolysis identification, which collectively reduced observer-dependent variability.¹⁴ Hu et al. applied a two-stage AI architecture to IFE analysis, classifying images as normal or abnormal and subsequently identifying immunoglobulin patterns using VGG-16, ResNet-18, and MobileNetV2. Their model achieved expert-level performance and incorporated Score-CAM for interpretability.¹⁵

While these studies demonstrate promising advances, most remain technique-specific and are not yet fully integrated into routine clinical workflows. To address this gap, the present study aims to develop an AI-assisted classification model that leverages image processing and deep learning to categorize SPE and IFE images into oncological, non-oncological, and healthy groups. Additionally, the study evaluates the diagnostic utility of applying SPE and IFE individually or in combination. The overarching goal is to establish a laboratory support system that enhances diagnostic accuracy, reduces interpretation errors, streamlines workflows, and supports clinicians in diagnosis, treatment planning, and patient follow-up.

Materials and methods

In this retrospective study the data of 1919 patients aged 18 years and older who employed both SPE and IFE testing between September and December 2024 were used. SPE was performed on the Sebia Minicap system using the Protein(e) 6 kit, and IFE was carried out on the Hydrasys 2 instrument with the Hydragel 9 IF kit (Sebia S.A.S., France). To eliminate therapeutic antibody–related interference, the Sebia Hydrashift 2/4 Daratumumab kit was employed; when necessary, patient urine samples were analyzed simultaneously to confirm the results. Demographic data (age and sex), ICD-10 diagnostic codes, and corresponding SPE and IFE images were retrieved from the Hospital Information Management System and Laboratory Information System. After all data were anonymized patients were categorized into three groups based on ICD-10 codes: oncological (n = 596), non-oncological (n = 663), and healthy (n = 660). In accordance with common practices in artificial intelligence applications in medicine, our dataset was randomly divided into 70% for training, 20% for validation, and 10% for testing.^16,17 The training set was used to train the model and update model parameters, whereas the validation set was used during training to optimize hyperparameters and evaluate model performance to reduce overfitting. This study was approved by the local ethics committee (25/12/2024-2024/010.99/11/6). Since this is a retrospective study, no consent was obtained; however, permission for data acquisition was obtained from the hospital administration.

Model architecture

Three models were developed: one using only SPE images, one using only IFE images, and a third combining both as input. All models shared a similar architecture to ensure consistent comparison. A pre-trained MobileNetV2—a convolutional neural network (CNN)-based model—was employed via transfer learning.^18,19 The top classification layer was removed, and the input size was set to 384 × 384 × 3 pixels to enable the capture of fine-grained details, particularly those arising from variations in electrophoresis practice. To balance generalization and adaptability, only the last 50 layers were made trainable, enhancing the model’s ability to distinguish borderline cases. Training was conducted on a CPU, with each model requiring roughly 1 h and 15 min.

Data processing and augmentation

All images were standardized to 384 × 384 pixels, and their intensity values were rescaled to a 0–1 range, ensuring compatibility with the neural network input requirements. To increase data diversity and model robustness, the training images were augmented through geometric and intensity-based modifications, including rotation, zoom, shifts, shear, brightness adjustment, and flipping. These augmentation strategies were designed to enhance model robustness and generalization, enabling accurate interpretation of real-world variations—especially in samples with low band intensity.²⁰ For the validation and test sets, only rescaling was applied to maintain the reliability of evaluation outcomes.

Training settings and hyperparameters

Each network was trained for 30 cycles using mini-batches of 8 samples. Optimization was performed with Adam at a learning rate of 1e-4, and categorical cross-entropy served as the loss function.^12,21 To improve training stability and mitigate overfitting, callbacks such as early stopping, ReduceLROnPlateau, and model checkpointing were implemented.²² Class weighting was applied to address class imbalance, ensuring adequate representation of minority classes and enhancing overall performance.²³ Class weights were calculated based on the inverse frequency of each class in the training dataset and incorporated into the loss function during model training.

Statistical analysis

The statistical analysis of the study were performed using Python 3.10 (Python Software Foundation, DE, USA) and the libraries NumPy 1.23.5, Pandas 1.5.3, TensorFlow/Keras 2.11, Scikit-learn 1.2.2, SciPy 1.10.1 and Matplotlib 3.7.1. A 3 × 3 confusion matrix was generated to examine which classes the model tends to confuse. Correct classifications appear along the diagonal of the matrix. Model predictions were benchmarked against ground truth labels using confusion matrices, from which specificity, precision, recall, and F1-scores were derived for each class.²⁴ Predictions were categorized as follows: a true positive (TP) occurs when the model correctly assigns a case to its actual class; a true negative (TN) when it correctly excludes a case from a class; a false positive (FP) when it incorrectly assigns a sample to a class; and a false negative (FN) when it fails to identify a case that belongs to a class.

Model’s discrimination ability was further examined using Receiver Operating Characteristic (ROC) analysis. For the three-class setting, a one-versus-rest strategy was employed, generating separate curves for each model. The area under the curve (AUC) value of ROC analysis closer to “1” indicates the better the performance.²⁵

To visualize the regions, the model focuses on during decision-making, the Gradient-weighted Class Activation Mapping (Grad-CAM) method was employed. This technique generates heatmaps (attention maps) using the gradients of the last convolutional layer. These maps, overlaid on the input images, highlight the regions influencing classification decisions and enhance the interpretation of the model’s outputs.²⁶

Web application

After training and evaluation, each model was converted to TensorFlow Lite (TFLite) to enable efficient, low-latency performance in mobile and web environments.¹⁸ The three TFLite models were integrated into a web application (https://www.aiphoresis.com.tr) featuring a user-friendly interface that allows users to select a model, upload a single SPE or IFE image, and receive predicted class labels along with probability scores. The Python-based backend operates via a REST API hosted on https://Render.com. User data is processed temporarily, with no permanent records stored.¹⁷ This setup ensures secure and rapid deployment of model outputs, supporting both research applications and practical use cases.

Results

SPE model performance

The class-wise 3 × 3 confusion matrix and performance metrics of the SPE model are presented in Tables 1 and 2, respectively. The model showed strong performance in identifying oncological cases, with high accuracy (84.66%) and ROC-AUC (0.9192), although recall was moderate (67.24%) (Figure 1(a)). In contrast, classification performance for non-oncological and healthy cases was notably lower. The non-oncological class achieved high recall but suffered from low precision, while the healthy class exhibited very low recall, indicating substantial difficulty in correctly identifying these samples.

Table 1.

The 3 × 3 Confusion Matrix of SPE Model.

Actual	Oncological	39	16	3
	Non-oncological	5	46	6
	Healthy	3	50	8
		Oncological	Non-oncological	Healthy
		Predicted

Table 2.

Performance Scores of the SPE Model.

	Accuracy (%)	Precision (%)	Recall (%)	Specificity (%)	F1 score (%)	ROC-AUC
Oncological	84.66	82.98	67.24	93.22	74.29	0.9192
Non-oncological	56.25	41.07	80.7	44.54	54.44	0.7171
Healthy	64.77	47.06	13.11	92.17	20.51	0.7148
Macro scores	68.56	57.04	53.68	76.64	49.75	0.7837
Weighted scores	68.57	56.96	52.84	77.10	49.22	0.7829

Figure 1.

Receiver Operating Curve (ROC) of oncological, non-oncological, and healthy prediction performance of (a) SPE, (b) IFE and (c) SPE + IFE images by computer vision-based deep learning models.

IFE model performance

The IFE model’s results are summarized in Tables 3 and 4. It demonstrated good performance in detecting oncological cases, with accuracy reaching 80.83% and ROC-AUC at 0.7711 (Figure 1(b)). However, recall for oncological cases was lower (50.00%) compared to the SPE model. Similar classification challenges were observed for the non-oncological and healthy classes: the former showed moderate recall but low precision, while the latter continued to exhibit both low recall and precision.

Table 3.

Confusion Matrix of IFE Model.

Actual	Oncological	30	18	12
	Non-oncological	5	36	26
	Healthy	2	38	26
		Oncological	Non-oncological	Healthy
		Predicted

Table 4.

Performance Scores of the IFE Model.

	Accuracy (%)	Precision (%)	Recall (%)	Specificity (%)	F1 score (%)	ROC-AUC
Oncological	80.83	81.08	50.00	94.74	51.86	0.7711
Non-oncological	54.92	39.13	53.73	55.56	45.28	0.6286
Healthy	59.59	40.62	39.39	70.08	40.00	0.6651
Macro scores	65.11	53.61	47.71	73.46	45.71	0.6883
Weighted scores	64.57	52.68	47.67	72.71	45.52	0.6854

Dual-input model performance

The combined model, integrating both SPE and IFE inputs, yielded the strongest results overall (Tables 5 and 6). It achieved very high accuracy (98.85%), perfect recall (100%) for oncological cases, and a ROC-AUC of 0.9892 (Figure 1(c)), clearly demonstrating its reliability in identifying oncological samples. Despite this improvement, classification of non-oncological and healthy cases remained moderate. The non-oncological class showed higher recall (60.71%) than the healthy class (40.98%), though precision was limited for both. Across all models, healthy cases consistently posed the greatest classification challenge, underscoring the need for further refinement in distinguishing non-oncological from healthy samples.

Table 5.

Confusion Matrix of Dual Model.

Actual	Oncological	57	0	0
	Non-oncological	2	34	20
	Healthy	0	36	25
		Oncological	Non-oncological	Healthy
		Predicted

Table 6.

Performance Scores of the DUAL Model.

	Accuracy (%)	Precision (%)	Recall (%)	Specificity (%)	F1 score (%)	ROC-AUC
Oncological	98.85	96.61	100.00	98.29	98.28	0.9892
Non-oncological	66.67	48.57	60.71	69.49	53.97	0.7432
Healthy	67.82	55.56	40.98	52.30	47.17	0.7779
Macro scores	77.78	66.91	67.23	73.36	66.47	0.8368
Weighted scores	77.61	66.76	66,66	72.90	66.10	0.8359

An example of SPE and IFE images, both alone and with Grad-CAM–generated visualizations of the respective models, is presented in Figures 2(a) and (b). These attention maps highlight the region’s most influential in the model’s classification decisions, offering insights into feature localization and interpretability. These visualizations further supported our findings: oncological cases exhibited focused activation over diagnostically relevant regions, while attention maps for healthy and non-oncological samples were more diffuse, reflecting model uncertainty and reduced feature saliency.

Figure 2.

SPE (a) and IFE (b) images, shown both alone and with Grad-CAM–generated visualizations of the respective computer vision–based deep learning models.

Discussion

The results of this study demonstrate that deep learning models can effectively classify SPE and IFE images into oncological, non-oncological, and healthy categories, with particularly strong performance in identifying oncological cases. These findings align with prior research highlighting the utility of artificial intelligence in enhancing electrophoresis interpretation; however, our approach introduces several novel dimensions that warrant further discussion.

Shen et al. developed a MobileNetV2-based model to detect low-concentration M-proteins in SPE, achieving high recall and successfully identifying M-spikes often missed during manual evaluation.¹³ While their model focused exclusively on SPE and employed binary classification (presence vs absence of M-protein), our study expands the scope by incorporating both SPE and IFE modalities and implementing a three-class classification framework. Notably, our dual-modality model achieved near-perfect accuracy in identifying oncological cases, suggesting that combining modalities may offer superior diagnostic granularity compared to SPE alone.

Chabrun et al. introduced the SPECTR system, which utilized four specialized deep learning models to segment protein fractions, detect anomalies, localize peaks, and identify hemolysis in SPE data.¹⁴ Their emphasis was on interpretive standardization and reducing observer variability. In contrast, our study focused on diagnostic classification rather than interpretive reporting. Nonetheless, both approaches share a common objective: minimizing subjectivity and enhancing reproducibility in electrophoresis analysis.

Hu et al. applied a two-stage architecture to IFE images, initially classifying them as normal or abnormal and subsequently identifying specific immunoglobulin patterns using multiple CNNs.¹⁵ They reported performance comparable to specialists and enhanced interpretability using Score-CAM visualizations. Our study similarly employed computer vision techniques but aimed to classify broader clinical categories rather than specific immunoglobulin patterns. The methodological overlap—particularly the use of MobileNetV2 and multi-stage architectures—underscores the adaptability of these models across electrophoretic modalities.

A key distinction in our findings is the observed difficulty in differentiating non-oncological and healthy groups. Unlike oncological cases, which exhibited distinct electrophoretic features, the profiles of non-oncological and healthy samples often overlapped, particularly in conditions involving subtle biochemical alterations such as vitamin deficiencies. This nuance was not addressed in the referenced studies, which primarily focused on binary or anomaly-based classification. Our results highlight the need for more granular labeling and the potential integration of additional clinical parameters to improve differentiation between these groups.

Study limitations

This study has several limitations. The dataset, while tailored and clinically relevant, was drawn from a single institution within a limited study period, which may restrict the broader applicability of the findings. Diagnostic labeling relied solely on ICD-10 codes, potentially overlooking clinical nuances. Additionally, the exclusive use of MobileNetV2 and the absence of external validation or integration of clinical/laboratory metadata may constrain broader applicability.

Conclusion

In summary, while our study builds upon foundational work, it advances the field by introducing a clinically oriented, multi-class classification model that leverages both SPE and IFE data. The strong performance in oncological detection suggests promising potential for real-world implementation as a decision support tool. However, further refinement is needed to improve accuracy in more ambiguous clinical categories. Future AI-based models integrating biochemical test results with expanded multi-center datasets may enhance diagnostic granularity.

Footnotes

ORCID iDs

Poyraz Doğan

Özlem Çakır Madenci

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Ethical approval

■■■.

Guarantor

■■■.

Contributorship

■■■.

Appendix

References

Dasgupta

Wahed

(eds). Protein electrophoresis and immunofixation. In: Clinical Chemistry, Immunology and Laboratory Quality Control: A Comprehensive Review for Board Preparation, Certification and Clinical Practice. 2nd ed. Amsterdam: Elsevier, 2021, pp. 391–406.

Lakshminarayanan

Janatpour

, et al. Detection by immunofixation of M proteins in hypogammaglobulinemic patients with normal serum protein electrophoresis results. Am J Clin Pathol 2007; 127: 746–751. https://doi.org/10.1309/QJ3PY18PMMJ8AYEH

Oliveros Conejero

Pascual Usandizaga

Garrido Chércoles

. Optimization of workflow and screening panels for the detection of malignant monoclonal gammopathies. Adv Lab Med 2020; 1: 20200042. https://doi.org/10.1515/almed-2020-0042

Aita

Arantes

Aita

, et al. Comparison between immunofixation and electrophoresis for the early detection of relapsed multiple myeloma. J Bras Patol Med Lab 2015; 51. https://doi.org/10.5935/1676-2444.20150057

McCudden

Jacobs

JFM

Keren

, et al. Recognition and management of common, rare, and novel serum protein electrophoresis and immunofixation interferences. Clin Biochem 2018; 51: 72–79. https://doi.org/10.1016/j.clinbiochem.2017.08.013

Baloda

McCreary

Goscicki

, et al. Tixagevimab plus cilgavimab does not affect the interpretation of electrophoretic and free light chain assays. Am J Clin Pathol 2023; 159(1): 10–13. https://doi.org/10.1093/ajcp/aqac137

. A review of the role of artificial intelligence in healthcare. Int J Med Inf 2023; 176: 105065.

Radha

Midunkumar

Muralibabu

, et al. Role of artificial intelligence in big data analytics. Int J Adv Res Sci Commun Technol 2024; 4: 586–591. https://doi.org/10.48175/IJARSCT-170

Lin

Paul

Guerra

, et al. The frontiers of smart healthcare systems. Health Care 2024; 12(23): 2330. https://doi.org/10.3390/healthcare12232330

10.

Khosravi

Fuchs

. Artificial intelligence–driven cancer diagnostics: enhancing radiology and pathology through reproducibility, explainability, and multimodality. Cancer Res 2025; 85(13): 2356–2367. https://doi.org/10.1158/0008-5472.CAN-24-3630

11.

Lawrence

Wang

Mendez

. Improving diagnostic accuracy and efficiency through AI in medical image interpretation. Lancet Digit Health 2025; 6(3): e210–e219. https://doi.org/10.1016/j.lanwd.2025.02.007

12.

Shorten

Khoshgoftaar

. A survey on image data augmentation for deep learning. J Big Data 2019; 6: 60. https://doi.org/10.1186/s40537-019-0197-0

13.

Chen

, et al. Artificial intelligence for diagnostic workflow optimization: a systematic review and meta-analysis. npj Digit Med 2024; 7(1): 28. https://doi.org/10.1038/s41746-024-01328-w

14.

Shen

, et al. Achieving a new artificial intelligence system for serum protein electrophoresis to recognize M-spikes. ACS Omega 2025; 10(4): 5770–5777. https://doi.org/10.1021/acsomega.4c09327

15.

Chabrun

Dieu

Ferre

, et al. Achieving expert-level interpretation of serum protein electrophoresis through deep learning driven by human reasoning. Clin Chem 2021; 67(10): 1406–1414. https://doi.org/10.1093/clinchem/hvab133

16.

Jiang

, et al. Expert-level immunofixation electrophoresis image recognition based on explainable and generalizable deep learning. Clin Chem 2023; 69(2): 130–139. https://doi.org/10.1093/clinchem/hvac190

17.

Oualil

Bencherif

Said Ouatik

. Shallow and deep learning classifiers in medical image analysis. Eur Radiol Exp 2024; 8(1): 31. https://doi.org/10.1186/s41747-024-00428-2

18.

Aldamani

Abuhani

Shanableh

. LungVision: X-Ray imagery classification for On-Edge diagnosis applications. Algorithms 2024; 17(7): 280. https://doi.org/10.3390/a17070280

19.

Kim

Cosa-Linan

Santhanam

, et al. Transfer learning for medical image classification: a literature review. BMC Med Imag 2022; 22: 69. https://doi.org/10.1186/s12880-022-00793-7

20.

Salehi

Khan

Gupta

, et al. A study of CNN and transfer learning in medical imaging: advantages, challenges, future scope. Sustainability 2023; 15(7): 5930. https://doi.org/10.3390/su15075930

21.

Kingma

. Adam: a method for stochastic optimization. ■■■, arXiv preprint. 2014.arXiv:1412.6980.

22.

Goodfellow

Bengio

Courville

. Deep learning. Cambridge, MA: MIT Press, 2016, pp. 190–192.

23.

Prechelt

. Early stopping — but when? In: Montavon

Orr

Müller

(eds). Neural Networks: Tricks of the Trade. 2nd ed. Berlin: Springer, 2012, pp. 53–67. https://doi.org/10.1007/978-3-642-35289-8_5

24.

Johnson

Khoshgoftaar

. Survey on deep learning with class imbalance. J Big Data 2019; 6: 27. https://doi.org/10.1186/s40537-019-0192-5

25.

Rainio

Teuho

Klén

. Evaluation metrics and statistical tests for machine learning. Sci Rep 2024; 14: 6086. https://doi.org/10.1038/s41598-024-56706-x

26.

Suara

Jha

Sinha

, et al. Is Grad-CAM explainable in medical images? In: Computer Vision and Image Processing. In: Communications in Computer and Information Science. Springer, 2024;2009:124–135. https://doi.org/10.1007/978-3-031-58181-6_11