Artificial intelligence-driven eye tracker models for Alzheimer's disease diagnosis: A systematic review and meta-analysis

Abstract

Background

Diagnosis of Alzheimer's disease (AD) is crucial for effective intervention and care planning. Recently, artificial intelligence-driven eye-tracking (AI-driven ET) tools have emerged as promising diagnostic aids.

Objective

To evaluate the diagnostic accuracy of AI-driven ET models for AD detection.

Methods

A systematic review and meta-analysis were conducted according to PRISMA2020. Different database and grey literature were searched up to March 2025. Data were analyzed with Meta-Disc 1.4 and R software. This meta-analysis has been registered in PROSPERO (CRD420251020284).

Results

Ten papers were included in the narrative synthesis and eight in the meta-analysis. Our systematic review found that most studies reported moderate to good accuracy of AI-driven ET tools in AD detection. The meta-analysis revealed that AI-driven ET tools achieved a sensitivity of 0.75 [95% CI: 0.67; 0.79], specificity of 0.75 [95% CI: 0.67; 0.81], positive likelihood ratio of 3.29 [95% CI: 2.36; 4.59], negative likelihood ratio of 0.36 [95% CI: 0.27; 0.48], diagnostic odds ratio of 10.40 [95% CI: 5.58; 19.39], and area under the ROC curve of 0.81. Deep learning seems to have better performance than supervised machine learning (SML). Among classification algorithms, support vector machines appear most robust across studies. The meta-regression identified population size, patient preparation, measurement systems, AI techniques, and SML algorithms as significant sources of heterogeneity.

Conclusions

AI-driven ET tools suggest moderate to good diagnostic accuracy for distinguishing AD patients from healthy controls, based on available case-control studies. However, evidence for effective screening in broader populations is lacking. Further research is needed to confirm these results across diverse clinical settings and strengthen model robustness.

Keywords

Alzheimer's disease artificial intelligence deep learning diagnosis eye tracking machine learning

Introduction

Alzheimer's disease (AD) is a progressive neurodegenerative disorder characterized by an insidious onset and gradual decline in cognitive and behavioral functions, including memory, comprehension, language, attention, reasoning, and judgment.¹ As the global population ages, AD has become a major public health concern, affecting millions of individuals worldwide.¹ The incidence of the disease doubles with each passing decade.¹ Furthermore, AD has recently emerged as one of the leading causes of death among individuals aged 70 and older.² The main challenge for clinicians is the difficulty of timely and precise clinical diagnosis of AD, particularly in its early stages.³ Given that therapeutic interventions are most effective before neuronal degeneration occurs in the brain, the need for earlier diagnosis of AD has become urgent.⁴ Over the past decade, significant progress has been made in this field. Economical and non-invasive tests, including blood, cerebrospinal fluid, saliva, urine, speech, and ocular tests, have been proposed as methods for the early detection of AD.^2,5,6 Among ocular tests, eye-tracking (ET) methods have recently emerged as promising tools for the diagnosis and classification of AD.^7–16 This novel method objectively quantifies eye movements and tracks the position of a subject's gaze.^7–16 Indeed, recent studies have demonstrated that AD patients exhibit distinct patterns in eye movement behaviors, like difficulties in maintaining stable fixation, deficient smooth pursuit, impaired saccades, and delayed vergence.¹⁷ Furthermore, the integration of artificial intelligence (AI), regardless different techniques and models, has further enhanced the diagnostic capabilities of ET data, supporting earlier intervention in AD.^7–16

On the other hand, current systematic reviews and meta-analyses have mainly explored the use of AI-based ET technologies in dementia, with no prior studies focusing exclusively on AD.¹⁸ Thus, to date, the effectiveness of AI-driven ET methods for AD diagnosis remains insufficiently supported by robust evidence. Considering this gap in the literature, this meta-analysis aims to synthesize the available evidence and offer a clearer understanding of the potential of this emerging tool in the detection of AD.

Methods

This systematic review and meta-analysis followed the recommendations provided in the Cochrane Handbook for Systematic Reviews of Interventions and complied with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines. The clinical question was formulated using the PICO framework (Population, Intervention, Comparison, Outcome) (Supplemental Table 1). This meta-analysis was registered in PROSPERO (CRD420251020284).

Search strategy

Two researchers independently performed a literature search on PubMed, Google Scholar, Scopus, PsychoINFO, Embase, Institute of Electrical and Electronics Engineers (IEEE), and the Cochrane Library to identify relevant studies. The search entries we used were as follows: ((“Alzheimer's disease” OR “cognitive impairment” OR “dementia” OR “Cognitive dysfunction” OR “Cognitive decline” OR “cognitive disorders” OR “Major neurocognitive disorder”) AND (“eye-tracking” OR “gaze-tracking” OR “eye movement” OR “saccade” OR “eye task”) AND (“Artificial intelligence” OR “Deep learning” OR “Machine learning” OR “Neural Networks” OR “algorithms” OR “Natural Language Processing”)). We exanimated the reference lists of relevant review articles and systematic reviews, with or without meta-analysis, to identify additional studies. We searched for conferences and unpublished papers, thesis database, and searching dissertation through OpenGrey. Technology-related gray literature databases such as Science.gov, arXiv, medRxiv, and ProQuest, as well as AI-focused journals from the ACM Digital Library, were explored. No restrictions were applied regarding the date or language. The search covered all records from inception up to March 2025. Records were imported into Mendeley to remove duplicates. Screening was conducted in two phases using the Covidence platform (https://www.covidence.org/). In the first phase, two independent authors manually screened the titles and abstracts. In the second phase, full texts were independently assessed for eligibility by the same two authors. Any discrepancies were resolved through discussions between the researchers.

Eligibility criteria

We included all studies that meet the following criteria: (i) Case-control, cohort, or cross-sectional studies; (ii) Adults diagnosed with AD using any recognized diagnostic criteria; (iii) Utilization of AI-driven tools related to ET for AD diagnosis, regardless of the type, model of AI, or ET technique; (iv) Availability of data for qualitative or quantitative analysis, including accuracy, or sensitivity (Recall or True Positive Rate), or specificity (True Negative Rate), or precision (Positive Predictive Value), or negative likelihood ratio (LR-), or positive likelihood ratio (LR+), or diagnostic odds ratio (DOR), or area under the ROC curve (AUC).

We excluded from our study: (i) Review articles, systematic reviews, or case reports; (ii) Adults with cognitive disorders other than AD; (iii) Animal models with AD; (iv) Studies using only ET for AD diagnosis without AI; (v) Studies using AI for AD diagnosis with techniques other than ET.

Data extraction

Data extraction was performed using Microsoft Excel Office 16. The relevant information and variables of interest were: (i) Authors, date of publication, origin, data source, study design and setting; (ii) Alzheimer cases characteristics (age, gender, disease duration, education years, diagnostic criteria); (iii) ET tools; (iv) AI models and algorithms; (v) Available findings including accuracy, sensitivity, specificity, F1 score, precision, AUC, true positive (TP), true negative (TN), false positive (FP), and false negative (FN).

When TP, TN, FP, and FN were unavailable, we calculated them using various formulas¹⁹ based on the available data and entered the results into Microsoft Excel Office 16 (Supplemental Table 2).

Quality assessment and publication bias

Two reviewers independently evaluated the methodological quality of the studies using the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) tool. This tool was selected due to its focus on diagnostic accuracy studies, aligning closely with the goals of our systematic review and meta-analysis. QUADAS-2 evaluates the methodological quality of diagnostic accuracy studies in four domains: patient selection, index test, reference standard, and flow and timing. It assesses bias within each domain and examines the applicability of findings in the first three domains. Independent evaluators assigned assessments of “low”, “high”, or “unclear” risk of bias to each guiding question within the domains. Conflicts were resolved through consensus. Quality assessment was evaluated using R software version 4.4.1 (R Foundation for statistical Computing, Vienna, Austria).²⁰ Deek's Funnel Plot and Egger's method were used to evaluate potential publication bias when the study count exceeded 10.²⁰ The Egger test significance level was set at 0.1.

Data analysis

As stated in the Cochrane Handbook for Systematic Reviews of Interventions, at least two studies are required to perform a meta-analysis.²¹ Given the potential differences among studies, we opted for a random-effects model with inverse variance weighting instead of a fixed-effects model. To generate the pooled values for sensitivity, specificity, LR+, LR-, and DOR, we used Meta-Disc Software version 1.4. The software calculations were based on the input of TP, TN, FP, and FN. Meanwhile, sensitivity and specificity are closely related, so changing the test cut-off to improve one usually lowers the other. To address this, a bivariate model was applied to jointly estimate sensitivity and specificity, using R version 4.4.1 (R Foundation for Statistical Computing, Vienna, Austria) with the mada package and the reitsma() function.²⁰ Hierarchical summary receiver-operating characteristic (HSROC) curve was also generated using R version 4.4.1.²⁰ The bivariate and HSROC models are more robust methods, accounting for both within-study and between-study variability. The AUC was estimated based on the bivariate analysis and the HSROC to evaluate diagnostic performance. To assess heterogeneity across studies, the Cochran Q test (with a significance level of p < 0.1) and the I² statistic was used. The levels of heterogeneity were categorized as follows: minimal (I² value, 0–25%), low (25–50%), moderate (50–75%), and high (>75%). To evaluate the influence of individual studies on the overall pooled estimates, we conducted a leave-one-out sensitivity analysis. This involved sequentially removing one study at a time and re-running the meta-analysis to evaluate whether any single study disproportionately affected the pooled estimates. A subgroup analysis according to AI models and algorithms alongside with patient preparation, measurement systems, task performed, and data collection was conducted if high heterogeneity is observed. In the subgroup analysis, sensitivity and specificity were estimated using a bivariate model implemented in R software. The threshold effect was analyzed using the Spearman correlation coefficient. Additionally, meta-regression was performed to identify the sources of heterogeneity between the studies.²⁰ The significance level for the pooled effect was set at p < 0.05. An effect was considered statistically significant when p = 0.05, provided that the 95% confidence interval (95% CI) did not include 0.

Results

Search outcomes

As shown in Figure 1, a total of 203 records were retrieved from electronic databases, 14 records were sourced from grey literature databases and reference searches, and 542 from Google Scholar. After excluding duplicates, the remaining 93 title and abstracts were independently reviewed by two investigators, leading to the evaluation of 16 full texts. Out of these, 2 papers were not retrieved. Among the remaining 14 studies, 4 were excluded, resulting in a final selection of 10 reports.

Figure 1.

PRISMA flow diagram illustrating the processes of search and screening.

The studies conducted by Yin et al.²² and Liu et al.²³ were excluded because they involved the same population and AI methods as those used by Sun et al.,¹⁵ to avoid duplication of data and potential bias in the analysis. However, although the study conducted by Zuo et al.¹⁴ used the same population source as Sun et al.,¹⁵ it was included in the meta-analysis because it employed different criteria of diagnosis of AD and various AI tools and analytical approaches, thereby providing independent and non-duplicative data suitable for inclusion.

Quality evaluation

Figure 2 illustrate our evaluation of the risk of bias and applicability across each domain of the included studies. In total, two studies were judged to have a high risk of bias, five had a low risk, and three studies raised some concerns. While the assessment of the patient selection, the index test, and the reference standard generally appears to be of low risk, there are more significant concerns regarding flow and timing.

Figure 2.

Risk of bias summary.

Study and subjects’ characteristics

Table 1 presents the characteristics of the studies included in the analysis. The geographical distribution revealed that the majority were conducted in China (n = 4), followed by Canada (n = 2), Taiwan region (n = 1), Brazil (n = 1), Spain (n = 1), and another study from Argentina (n = 1). All the studies were case-control studies conducted with hospital-based populations. The studies were published between 2018 and 2024. The sample size for AD ranged from 19 to 166, while the sample size for healthy controls (HC) ranged from 29 to 107. Nine studies excluded participants with confounding factors such as a history of mental disorders, illicit drug abuse, acute or chronic liver and kidney dysfunction, malignant tumors, or other severe underlying health conditions. The mean age of AD cases ranged from 68 to 76 years, with the percentage of females varying from 13.00% to 66.67%. AD was diagnosed using the National Institute on Aging and Alzheimer's Association criteria in 4 studies, the National Institute of Neurological and Communicative Disorders and Stroke – AD and Related Disorders Association criteria in 2 studies, and the DSM-IV criteria in 2 studies, one of which also used the National Institute of Neurological and Communicative Disorders and Stroke – AD and Related Disorders Association criteria. However, the diagnostic criteria for AD were not provided in 3 studies.^9,11,13

Table 1.

Study characteristics and main results.

Study	Country	Study design/ Setting	Alzheimer disease (AD) cases/ healthy controls (HC)/Duration of AD (standard deviation (SD)) [Min – Max]	Gender / age (SD) [Minimum – Maximum]	Education years (SD) [Minimum – Maximum]	Data source	AD criteria	Validation/Main results
Li et al. (2024)¹⁶	China	Case-control Hospital based study	166 AD /107 HC/Duration: not available (NA)	Female: 61.20% Mean age: 70.46 years [50.00- 90.00 years]	9.00 [6.00–18.00 years]	Department of Geriatrics, Shanghai Sixth People's Hospital, the Department of Neurology, Shanghai Tong Ji Hospital, and the community between April 2021 to July 2023.	National Institute of Aging and Alzheimer's Association (NIA-AA criteria)	Internal validation The supervised ML performed well, showing a strong ability to differentiate between AD and HC, with improved results in the validation phase.
Sun et al (2022)¹⁵	China	Case-control Hospital based study	108 AD/ 102 HC Duration: NA	NA Age: [40.00–92.00 years]	NA	Cognitive impairment clinics, Tianjin HuanHu Hospital, Tianjin, China From September 2020 to September 2021	NIA-AA criteria	Internal validation Both machine learning and deep learning methods are capable of enabling intelligent diagnosis of AD. Furthermore, incorporating autoencoder networks along with weight-adaptive fusion layers for extracting and combining features notably enhances the performance of the proposed denoising sparse autoencoder
Zuo et al (2024)¹⁴	China	Case-control Hospital based study	38 AD/ 68 HC Duration: NA	Female: 60.50% Mean age: 68.00 (7.00) [54.00–80.00] years	NA	Cognitive impairment clinics, Tianjin Huanhu Hospital, Tianjin, China	NINCDS-ADRDA Alzheimer's Criteria	Internal validation The Multi-Scale (MC)-convolutional neural network (CNNs) model effectively captured distinctive eye-tracking features from both AD patients and HC, transforming them into a distance-based representation to achieve improved and reliable classification performance in identifying visual attention impairments associated with AD. Additionally, this study implemented a data augmentation technique by pairing heatmaps to expand the training dataset. Experimental results on recruited participants confirmed the strong performance of the proposed approach.
Heidarzadeh et al (2023)¹³	Canada	Case-control Hospital based study	30 AD / 30 HC Duration: NA	Female: 40.00% Mean age: 70.00 (15.00) years	12.00 (2.00) years	University of Pitts burgh School of Medicine	NA	Internal validation Unlike other costly or invasive techniques, this study presents a non-invasive, cost-effective machine learning-based approach that offers a promising avenue for clinicians and researchers in the early diagnosis of AD. Moreover, the proposed method can serve as a complementary tool alongside approaches based on linguistic feature analysis.
Song et al (2024)¹²	China	Case-control Hospital based study	111 AD/66 HC Duration: NA	Female: 59.50% Mean age: 76.00 (9.00) years	9.00 (7.00) years	Memory Clinic, Department of Geriatrics	NIA-AA criteria	Internal validation In this study, a machine learning-based three-classification model was developed to differentiate between AD, Mild Cognitive Impairment, and normal cognition using eye movement features. The model achieved a classification accuracy of 68.20%, outperforming most previous models. Additionally, this research highlights the relationships between eye movement parameters and various cognitive subdomains, including attention, episodic memory, immediate memory, language, and visuospatial abilities.
Lin et al (2024)¹¹	Taiwan region	Case-control Hospital based study	31 AD /64 HC Duration: NA	NA	NA	Dementia Center and Department of Neurology, Shuang-Ho Hospital, Taipei Medical University	NA	Internal validation This study presents a novel non-invasive approach using eye movement analysis, combined with advanced machine learning algorithms, to detect early signs of dementia, particularly AD. By analyzing eye movement patterns of participants during specific tasks, the study identifies distinct features that could serve as early biomarkers for dementia. The use of digital eye tracking technologies offers a rapid, cost-effective solution for early diagnosis, with potential to improve patient care and management.
Biondi et al (2018)¹⁰	Argentina	Case-control Hospital based study	26 AD/ 43 HC Duration: NA	Female: NA Mean age: 69.00 (7.30) years	NA	Hospital Municipal of Bah´ıa Blanca, Buenos Aires, Argentina	DSM-IV	Internal validation The deep-learning model demonstrated promising results in distinguishing between eye movement behaviors of AD patients and healthy controls, with high accuracy.
Field et al (2020)⁹	Canada	Case-control Hospital based study	19 AD/ 39 HC Duration: NA	NA	NA	University of British Colombia	NA	Internal validation Machine-learning mediated analysis of eye-tracking data achieved promising AD diagnosis performance compared to controls with AUC of 71.00% (95% CI 0.59–0.82). *Increasing the size of the corpus with ongoing recruitment, and combining demographic and clinical data alongside multimodal feature fusion, may help to improve AD diagnosis
Lage et al (2020)⁸	Spain	Case-control Hospital based study	33 AD/ 29 HC Duration: 4.94 ± 1.73	Female: 66.67% Mean age: 68.17 (6.96) years	NA	Cognitive Disorders Unit of the Marqués de Valdecilla University Hospital (Santander, Spain)	NIA-AA criteria	Internal validation Oculomotor behavior can reflect the neuroanatomical distribution of pathology in Alzheimer's disease (AUC of 97.50% for the differentiation between Alzheimer's disease vs. controls) *Machine learning approaches can enhance the clinical applicability of eye-tracking data, providing high accuracy in distinguishing between AD and HC
Pereira et al (2020)⁷	Brazil	Case-control Hospital based study	33 AD/ 43 HC Duration: NA	Female: 13% Mean age: 72.97 (6.26)	10.42 (5.23)	university hospital in Sao Paulo, Brazil	NINCDS-ADRDA DSM-IV	Internal validation Eye movement analysis, combined with machine learning techniques, offers promising potential for early detection and differentiation of cognitive impairments such as mild cognitive impairment and AD * The use of visual attention and search tasks can provide important insights into the cognitive status of individuals and be used in clinical settings for diagnosis and monitoring. * RF was identified as the best model, with 13 features, achieving a good combination of accuracy, precision, and recall.

AD: Alzheimer disease; NIA-AA: National Institute of Aging and Alzheimer's Association; HC: healthy controls; NA: not available; SD: standard deviation, outer autoencoder module, and a classifier module; MC: multi-scale; CNN: convolutional neural networks.

Narrative synthesis

Ten studies were included for these outcomes.^7–16 Table 2 provides a summary of the specific techniques and key findings related to the use of AI tools and ET methods for diagnosing of AD. Supplemental Table 3 summarizes the different aspects of patient preparation, measurement systems, tasks performed, and data collected across various studies. Four studies implemented structured preparation with precise physical control,^12,14–16 while the remained studies applied minimal or unspecified preparation.^7–11,13 These studies included measures such as head-pose calibration, chin or forehead rests to limit movement, fixed viewing distances, and testing in quiet, well-lit rooms with controlled lighting. Some studies excluded participants with uncorrected vision or significant ophthalmologic disease, while others allowed the use of usual corrective lenses but did not perform formal refractive correction. In certain cases, poor eye-tracker calibration was also an exclusion criterion. Measurement Systems were based on tablet- or Screen-based ET Systems in two studies,^12,16 3D systems in two other studies,^14,15 Audio-Derived Eye-Movement Estimation in two studies,^9,13 Task-Specific ET during Reading or Cognitive Tests in two other studies,^7,9 Oculomotor Tasks in one study.⁸ However, it was not described in one study.¹¹ Regarding the tasks used to evaluate ET, five studies employed visual paired comparison (VPC) or recognition tasks,^7,8,14–16 two studies focused on oculomotor or eye movement tasks,^11,12 two studies utilized picture description with speech-linked eye tracking,^9,13 and one study examined reading tasks.¹⁰ Regarding data collection, five studies focused on eye movement and oculomotor features,^8,10–12,16 including standard kinematic parameters such as saccades and anti-saccades; two studies utilized fixation maps and heatmaps^14,15; and three studies employed speech, picture description, and cognitive assessments.^7,9,13

Table 2.

Ai results for diagnosing Alzheimer's disease via eye-tracking.

Study	Eye tracking tools/Tasks	Artificial intelligence (AI) type and model	Accuracy [95% confidence interval (CI)]	Sensitivity (Recall) [95% CI]	F1 score	Positive predictive value (precision)	Area under the curve [95% CI]	True positive	False positive	True negative	False negative
Li et al (2024)¹⁶	-Front-facing camera (Xiomi Mi 5 pro, sample rate of 30 Hz)/ -Visual paired comparisons (VPC tasks) -Anti-saccade task	- Supervised machine learning (SML) (Classification: Logistic Regression (LR))	0.86 [0.85, 0.86]	0.82 ± 0.10	0.87 ± 0.15	0.93 [0.84, 1.02]	0.91 [0.83, 0.99]	136	10	97	30
Sun et al (2022)¹⁵	-Self-designed 3D eye-tracking system /-3D paired VPC tasks (3D stereo images) -Original scene vs modified scene (one object added/removed)	- SML (Feature Extraction: Fourier Coefficients (FOU), Karhunen-Loeve Coefficients (KAR); Classification: Support Vector Machine (SVM), K-Nearest Neighbo (KNN)) - Deep learning (DL) (Feature Extraction: Convolutional Neural Networks (CNNs) (Visual Geometry Group-16 network and Resnet-18 network) Denoising sparse autoencoder (DSAE) Classification: fully connected classification layer (FC)	SML: -FOU_SVM: 0.74 ± 0.04 -FOU_KNN: 0.63 ± 0.07 -KAR_SVM: 0.72 ± 0.03 -KAR_KNN: 0.69 ± 0.04 DL: -VGG16_FC: 0.80 ± 0.02 -Resnet18_FC: 0.81 ± 0.03 -DSAE: 0.85 ± 0.05	SML: -FOU_SVM: 0.78 ± 0.04 -FOU_KNN: 0.65 ± 0.03 -KAR_SVM: 0.80 ± 0.02 -KAR_KNN: 0.68 ± 0.05 DL: -VGG16_FC: 0.92 ± 0.02 -Resnet18_FC: 0.87 ± 0.01 -DSAE: 0.89 ± 0.04	SML: -FOU_SVM: 0.74 ± 0.03 -FOU_KNN: 0.64 ± 0.03 -KAR_SVM: 0.78 ± 0.04 -KAR_KNN: 0.66 ± 0.05 DL: -VGG16_FC: 0.87 ± 0.03 -Resnet18_FC: 0.86 ± 0.04 -DSAE: 0.88 ± 0.04	SML: -FOU_SVM: 0.71 ± 0.03 -FOU_KNN: 0.63 ± 0.02 -KAR_SVM: 0.76 ± 0.04 -KAR_KNN: 0.65 ± 0.04 DL: -VGG16_FC: 0.82 ± 0.05 -Resnet18_FC: 0.85 ± 0.03 -DSAE: 0.87 ± 0.04	SML: -FOU_SVM: 0.95 -FOU_KNN: 0.92 -KAR_SVM: 0.93 -KAR_KNN: 0.92 DL: -VGG16_FC: 0.95 -Resnet18_FC: 0.97 -DSAE: 0.98	SML: FOU_SVM: 84 FOU_KNN: 70 KAR_SVM: 86 KAR_KNN: 73 DL: VGG16_FC: 99 Resnet18_FC: 94 DSAE: 96	SML: FOU_SVM: 31 FOU_KNN: 41 KAR_SVM: 27 KAR_KNN: 40 DL: VGG16_FC: 22 Resnet18_FC: 17 DSA: 14	SML: FOU_SVM: 68 FOU_KNN: 61 KAR_SVM: 75 KAR_KNN: 62 DL: VGG16_FC: 80 Resnet18_FC: 85 DSAE: 88	SML: FOU_SVM: 24 FOU_KNN: 38 KAR_SVM: 22 KAR_KNN: 35 DL: VGG16_FC: 9 Resnet18_FC: 14 DSAE: 12
Zuo et al (2024)¹⁴	-Self-designed 3D eye-tracking system /-3D paired VPC tasks	DL: Multi-Scale (MC)- CNNs	0.84 ± 0.05	0.86 ± 0.03	0.83 ± 0.06	0.82 ± 0.01	0.90 ± 0.06	33	7	61	5
Heidarzadeh et al (2023)¹³	-Simple camera/-Audio recordings -Text transcripts -Eye tracking combined to speech analysis -Cookie Theft picture description task	SML: Classification (LR/KNN/SVM/random forest (RF)) Decision trees classification (DTC)	LR: 0.78 KNN: 0.68 SVM: 0.80 RF: 0.72 DTC: 0.73	LR: 0.76 KNN: 0.40 SVM: 0.73 RF: 0.60 DTC: 0.63	LR: 0.77 KNN: 0.55 SVM: 0.78 RF: 0.68 DTC: 0.70	LR: 0.79 KNN: 0.93 SVM: 0.86 RF: 0.80 DTC: 0.80	NA	LR: 23 KNN: 12 SVM: 22 RF: 18 DTC: 19	LR: 6 KNN: 1 SVM: 4 RF: 4 DTC: 5	LR: 24 KNN: 29 SVM: 26 RF: 26 DTC; 25	LR: 7 KNN: 18 SVM: 8 RF: 12 DTC: 11
Song et al (2024)¹²	A desktop-mounted eye-tracker monitored eye movements using the 250 Hz pupil-corneal reflex mode/-Fixation task, prosaccade task, the anti-saccade task	-SML: *Classficaition: Gradient Boosting Classifier (GBC) (Ensemble Learning (Boosting))/Light Gradient Boosting Machine (LightGBM) (Ensemble Learning (Boosting))/ RF Classifier/ Extra Trees Classifier / LR	GBC: 0.68 LightGBM: 0.65 RF Classifier: 0.61 Extra Trees Classifier: 0.66 LR: 0.61	GBC: 0.67 LightGBM: 0.64 RF Classifier: 0.60 Extra Trees Classifier: 0.67 LR: 0.60	GBC: 0.66 LightGBM: 0.63 RF Classifier: 0.60 Extra Trees Classifier: 0.64 LR: 0.58	GBC: 0.67 LightGBM: 0.65 RF Classifier: 0.62 Extra Trees Classifier: 0.66 LR: 0.60	GBC: 0.87 LightGBM: 0.86 RF Classifier: 0.83 Extra Trees Classifier: 0.88 LR: 0.84	GBC: 74 LightGBM: 71 RF Classifier: 67 Extra Trees Classifier: 74 LR: 67	GBC: 37 LightGBM: 38 RF Classifier: 41 Extra Trees Classifier: 38 LR: 44	GBC: 29 LightGBM: 28 RF Classifier: 25 Extra Trees Classifier: 28 LR: 22	GBC: 37 LightGBM: 40 RF Classifier: 44 Extra Trees Classifier: 37 LR: 44
Lin et al (2024)¹¹	-Simple camera/ -Oculomotor tasks: image description and article reading	-SML Classification: LR for article reading DTC for image description	-Logistic Regression: 0.94 -DTC: 0.82	-Logistic Regression: 1.00 -DTC: 1.00	NA	NA	-Logistic Regression: 1.00 -DTC: 0.91	-Logistic Regression: 31 -DTC: 31	-Logistic Regression:6 -DTC: 17	-Logistic Regression: 58 -DTC: 47	-Logistic Regression: 0 -DTC: 0
Biondi et al (2018)¹⁰	-Simple camera /-Saccade test and fixation duration during the reading of each sentence	DL: DSAE	0.89	0.91	NA	NA	NA	24	6	37	2
Field et al (2020)⁹	-Simple camera /-Participants described the “Cookie Theft Picture” from the Boston Aphasia Battery. (fixation and saccades (from eye-tracking)	SML: KNN LR *RF	NA	NA	NA	NA	Total: 0.71	NA	NA	NA	NA
Lage et al (2020)⁸	Eye movement recordings were carried out with OSCANN, an eye-tracking device based on video-oculography technology/ NA	SML: NA	0.97	NA	NA	NA	0.97	NA	NA	NA	NA
Pereira et al (2020)⁷	-Tobii TX300 Eye tracker/-Familiarization Phase (Participants see a single image on a screen) -Test Phase (After familiarization, four images are shown simultaneously)	SML: SVM RF	SVM: 0.74 ± 0.11 RF: 0.78 ± 0.12	SVM: 0.62 ± 0.22 RF: 0.71 ± 0.17	SVM: 0.66 ± 0.17 RF: 0.67 ± 0.23	SVM: 0.76 ± 0.18 RF: 0.83 ± 0.20	SVM: 0.80 ± 0.13 RF: 0.80 ± 0.18	SVM: 20 RF: 23	SVM: 6 RF: 5	SVM: 37 RF: 38	SVM: 13 RF: 10

AI: artificial intelligence; CI: confidence interval; FOU: Fourier Coefficients; KAR: Karhunen-Loeve Coefficients; SVM: Support Vector Machine; KNN: K-Nearest Neighbor; FC: fully connected classification layer; SML: supervised machine learning; DL: deep learning; LightGBM: Light Gradient Boosting Machine; GBC: Gradient Boosting Classifier; MC: Multi-Scale; CNNs: Convolutional Neural Networks; LR: Logistic Regression; RF: random forest; DTC: Decision trees classification; VPC: visual paired comparisons; DSAE: denoising sparse autoencoder.

On the other hand, the included studies utilized a variety of ET tools, ranging from manual data extraction methods in ET tasks to front-facing cameras and desktop-mounted devices, as well as more advanced systems like the Tobii TX300 and OSCANN. Tasks varied across studies, including VPC, saccade and fixation tests, sentence reading, image description, and the Cookie Theft picture task. While some studies focused on standard 2D tasks, others implemented 3D ET setups. Regarding AI tools, our findings indicated that supervised machine learning (SML) (82.14%) was more frequently employed than deep learning (DL) (17.86%) for the diagnosis of AD. Unsupervised machine learning (ML) was not utilized in any of the included studies. All models of SML were classifications. Nonetheless, 4.35% of the models remained unclassified. Within the classification algorithms, Logistic Regression (LR) and SVM had the largest share in the dataset (21.10%), followed by k-Nearest Neighbors (KNN), Random Forest (RF), and Ensemble Learning other than RF (EL-), each contributing 15.80%. Decision tree classification (DTC) has the smallest share at 10.50%. In the context of DL, two studies employed convolutional neural networks (CNNs), one of which also incorporated a denoising sparse autoencoder (DSAE), while another study utilized a DSAE (Table 2). AI-driven ET tools showed accuracy of 0.61–0.97, sensitivity of 0.55–0.92, precision of 0.61–0.93, F1 scores of 0.55–0.88, and AUC values of 0.60–0.93.^{7–9,11,12,14–16} Results showed that 4 studies using SML had an AUC of 0.90–1.00 and three studies had an AUC of 0.70–0.89. All studies using DL had an AUC of 0.90- 1.00. All studies used an internal validation. Detailed main findings of each study are summarized in Tables 1 and 2.

Meta-analysis

Eight studies were included in this outcome.^7,10–16

Performance of the AI-driven ET tracking tools in AD detection

The bivariate analysis showed a sensitivity of 0.75 [95% CI: 0.67; 0.79] and specificity of 0.75 [95% CI: 0.67; 0.81] with significant heterogeneity between studies (I²= 76.4%). The univariate analysis indicated a sensitivity of 0.74 [95% CI: 0.72; 0.76] and specificity of 0.72 [95% CI: 0.70; 0.74] with significant heterogeneity (Figure 3(a) and (b)). The AI-driven ET tracking tools had a LR + of 3.29 [95% CI: 2.36; 4.59], a LR- of 0.36 [95% CI: 0.27; 0.48] and a DOR of 10.40 [95% CI: 5.58; 19.39], all exhibiting significant heterogeneity (Figure 3(c)–(e), respectively).

Figure 3.

(a) Forest plot of the sensitivity; (b) forest plot of the specificity; (c) forest plot of the LR+; (d) forest plot of the LR-; (e) forest plot of the DOR. LR: logistic regression; FOU_SVM: Fourier Coefficients_Support Vector Machine; FOU_KNN: Fourier Coefficients_k-Nearest Neighbors; KAR_SVM: Karhunen-Loeve Coefficients_ Support Vector Machine; KAR_KNN: Karhunen-Loeve Coefficients_ k-Nearest Neighbors; CNNs: convolutional neural networks; DSAE: denoising sparse autoencoder; RF: random forest; DTC: Decision tree classification; GBC: Gradient Boosting Classifier; LGBM: Light Gradient Boosting Machine; ETC: Extra Trees Classifier.

Leave-one-out sensitivity analysis showed that no single study substantially influenced the pooled diagnostic accuracy metrics. Sensitivity ranged narrowly from 0.73 to 0.75, specificity from 0.71 to 0.74, LR + from 3.13 to 3.51, LR- from 0.34 to 0.38, and DOR from 9.41 and 11.69 (Supplemental Table 4).

The HSROC curve in Figure 4 exhibits a smooth shape without a distinct “shoulder and arm” pattern, suggesting the absence of a threshold effect. The Spearman correlation coefficient also indicated that there was no threshold effect on our results (r = −0.37; p = 0.07). The AUC value was 0.81 indicating good diagnostic accuracy, with the small standard errors (0.03) confirming the reliability of these estimates.

Figure 4.

HSROC curve illustrating the diagnostic performance of AI-driven ET tools in Alzheimer's disease detection. The plot shows the summary point representing the pooled sensitivity and specificity across studies, along with the 95% confidence region and 95% prediction region. Each circle corresponds to an individual study's sensitivity and specificity estimates. The spread of these points shows variability among studies. The prediction region indicates the expected range of diagnostic accuracy for future studies conducted under similar conditions.

Subgroup analysis according to AI models

SML models showed a sensitivity of 0.69 [95% CI: 0.65 to 0.73] (I² = 78.00%), specificity of 0.72 [95% CI: 0.62 to 0.81] (I² = 78.00%), AUC of 0.73, LR + of 2.73 [95% CI: 1.96; 3.80] (I² = 92.40%), LR- of 0.47 [95% CI: 0.37; 0.62] (I² = 85.30%), and DOR of 6.51 [95% CI: 3.55; 11.96] (I² = 89.80%). DL had a sensitivity of 0.89 [95% CI: 0.85; 0.92], specificity of 0.84 [95% CI: 0.79; 0.88], AUC of 0.92, LR + of 5.43 [95% CI: 4.35; 6.78], LR- of 0.13 [95% CI: 0.10; 0.18], and DOR of 43.69 [95% CI: 28.69; 66.53], showing no heterogeneity (I² = 0.00%).

Subgroup analysis according to classification algorithms among SML

The DTC demonstrated the highest sensitivity, specificity, LR+, and DOR, and the lowest LR. However, the large heterogeneity and wide 95% CI for DOR, LR-, and LR + indicate that these results are imprecise, suggesting that its diagnostic performance may vary considerably across studies. SVM and LR demonstrated the highest values for sensitivity, specificity, LR+, DOR, and AUC along with the lowest LR-. While LR results exhibited significant heterogeneity, the SVM results did not show significant heterogeneity across different metrics. However, KNN, RF and EL- models showed lower sensitivity, specificity, DOR, LR + and higher LR-. LR and RF analysis showed wide 95% CI for the LR + and the DOR, reflecting imprecision and suggesting uncertainty in the estimates. Supplemental Tables 5–8 resume the subgroup analysis according to classification algorithms.

Subgroup analysis according to patient preparation

When comparing patient preparation strategies, studies with structured preparation showed slightly higher sensitivity than those with minimal or unspecified preparation, indicating a marginally better ability to detect true positives. However, minimal preparation demonstrated higher specificity, a greater LR+, DOR, and AUC (Supplemental Table 9). This unexpectedly high performance in the minimal preparation group may reflect an overestimation of accuracy due to less controlled testing conditions, rather than a true superiority.

Subgroup analysis according to measurement systems

Accuracy and robustness varied across subgroups. Task-specific ET during reading/cognitive tests showed the weakest performance (AUC 0.48), with wide 95% CI in DOR and marked heterogeneity. Tablet/screen-based systems performed slightly better (AUC 0.61) but also displayed wide 95% CI and high heterogeneity. 3D systems achieved moderate accuracy (AUC 0.80), but their DOR estimates were imprecise with wide 95% CI. Audio-derived estimation showed similar accuracy (AUC 0.84), though interpretation remains limited by reliance on a single dataset and wide 95% CI. Oculomotor tasks demonstrated the highest and most robust accuracy (AUC 0.95), with narrow 95% CI, high DOR, and lower heterogeneity (Supplemental Table 10).

Subgroup analysis according to task performed

VPC and recognition tasks demonstrated moderate sensitivity and specificity and good overall accuracy, making them suited for detecting AD. In contrast, oculomotor and eye movement tasks showed moderate sensitivity, lower specificity, and a low DOR, indicating they are less reliable overall. Picture description or speech-linked ET tasks exhibited lower sensitivity but high specificity, along with moderate LR+, suggesting they are particularly effective for confirming the absence of AD and provide good overall discriminative ability (Supplemental Table 1 1). Results from reading tasks could not be pooled, as they originated from a single study using one AI tool (Biondi et al.).¹⁰ Biondi et al. found that this task had a high accuracy (Tables 1 and 2).

Subgroup analysis according to data collection

Eye movement and oculomotor features showed moderate sensitivity and specificity with acceptable discriminative ability. Fixation maps and heatmaps perform better, with higher sensitivity and specificity, a higher LR + results, and DOR. Speech, picture description, and cognitive tasks demonstrate lower sensitivity but high specificity, combined with the highest AUC and a good LR+ (Supplemental Table 1 2).

Meta-regression

The meta-regression showed that the significant sources of heterogeneity in our meta-analysis were population size, patient preparation, measurement systems, AI techniques, and SML algorithms. While SML algorithms and measurement systems contribute to reducing the effect on the outcome (with a RDOR of 0.66 and 2.65 respectively), population size (≥100 AD), patient preparation, and AI techniques contribute to elevating the outcome (with a RDOR of 7.14, 5.14, and 9.06, respectively). However, study origin, year of publication, criteria of AD diagnosis and SML models were not a source of heterogeneity (Supplemental Table 13).

Discussion

Despite the growing interest in AI in the medical field, there are limited studies focusing on AI-based ET in cognitive disorders. To the best of our knowledge, this is the first innovative systematic review and meta-analysis evaluating the performance of AI-driven ET tools for AD detection. However, only one meta-analysis has assessed this tool in patients with dementia, highlighting the robust performance of various ML and DL algorithms, with accuracy, sensitivity, and specificity of 88%, 85%, and 86%, respectively.¹⁸ Our systematic review, including 10 studies, found that most studies reported high accuracy, sensitivity, precision, F1 score, and AUC values near 1, indicating high effectiveness of AI-driven ET tools in distinguishing AD patients from HC. Our meta-analysis of 8 studies showed a moderate to good overall performance of AI-driven ET tools, with a DOR of 10.40 and an AUC of 0.81. The AI-driven ET tool was able to accurately detect 75% of AD cases and exclude 75% of non-AD cases. In practical terms, this indicates that out of 100 patients with mild symptoms, the test would fail to detect approximately 25 cases of AD and would incorrectly classify about 25 patients without AD as having the disease. However, our findings should be interpreted with caution for several reasons. First, all included studies were conducted in hospital-based settings, where cases and controls are typically well-defined. These hospital-based populations are more uniform compared with the general population, which is characterized by a wider variety of symptoms and comorbidities. This discrepancy may have contributed to an overestimation of the pooled diagnostic performance. Second, some results in the subgroup analyses had wide 95% CI, indicating uncertainty that may reduce the performance of the test. Third, the included studies exhibited considerable heterogeneity, with most estimates exceeding 80%. However, it is important to acknowledge that heterogeneity is an unavoidable challenge in meta-analyses.²⁴ While our results indicated the absence of threshold effect on our meta-analysis heterogeneity, some confounding factors were identified in the meta-regression, including sample size, patient preparation, measurement system, AI techniques, and SML algorithms.

The subgroup analysis showed higher accuracy and low heterogeneity in unprepared patients, likely reflecting overestimation rather than true diagnostic gain. In such cases, the models may have capitalized on artifacts introduced by uncontrolled conditions, rather than learning disease-specific features. In the absence of external validation, several uncontrolled factors may contribute to this apparent gain in accuracy. Variability in head positioning, such as tilts or inconsistent viewing distances, could alter gaze trajectories in a systematic way that artificially enhances class separation. Similarly, the lack of refractive correction may cause blurred or distorted visual input, leading participants to exhibit atypical gaze or response patterns unrelated to the underlying disease. By contrast, strict preparation protocols are designed precisely to reduce technical noise. These procedures ensure uniform alignment, stable viewing distance, and visual clarity, thereby improving data reliability and reproducibility across participants. Although this may reduce apparent accuracy, it provides a more realistic estimate of the true diagnostic capability of AI-driven ET tools, particularly when aiming for deployment in clinical or multi-center settings. Ultimately, the discrepancy between prepared and unprepared conditions underscores the importance of external validation and standardized acquisition protocols.

The subgroup analysis by measurement system suggested that oculomotor tasks and audio-derived eye-movement estimation may represent promising approaches for AI-driven diagnosis of AD. However, performance metrics for each modality were reported in only a single study, with limited methodological detail, and therefore these findings cannot be considered conclusive. Among the remaining ET modalities, 3D systems appeared the most reliable, achieving higher sensitivity and specificity than tablet- or screen-based systems and task-specific ET, with an AUC of 0.80, indicating moderate to good discriminative ability. Nevertheless, although 3D systems and oculomotor tasks demonstrated the most effective performance, the overall evidence remains limited and inconclusive due to wide 95% CI and small study numbers, underscoring the need for further research to validate their reliability and generalizability.

Although VPC tasks show the most consistent effectiveness across studies, highlighting their potential as screening tools, heterogeneity remains high, with wide 95% CI for the DOR. In contrast, oculomotor tasks are less reliable for detecting AD due to moderate sensitivity and low specificity and DOR, whereas picture or speech-linked ET tasks, despite lower sensitivity, demonstrate higher specificity and DOR, making them more effective for confirming the absence of AD. Indeed, image description and Cookie Theft can be influenced by examiner subjectivity and individual differences, affecting consistency. In contrast, 3D VPC uses stereoscopic stimuli, engaging depth perception and spatial reasoning, and has been shown to stimulate richer eye movements and brain activity than 2D tasks.²⁵ Boujelbane et al. highlighted the VPC task as a reliable alternative to conventional paper-based cognitive assessments.²⁶ The VPC task is simple, accessible, and culturally neutral, making it suitable for diverse populations and helping to overcome barriers linked to standard cognitive testing.²⁶ Furthermore, some studies in our meta-analysis used high-frequency eye trackers (Song et al. (2024)¹² and Pereira et al. (2020)⁷), while others relied on low-frequency static cameras (Li et al. (2024)¹⁶), contributing to variability in AI tool accuracy. Raynowska et al. showed that low-frequency and low-resolution ET systems had major drawbacks compared to high-frequency devices, including frequent signal interruptions, poor capture of saccadic main sequence patterns, fewer saccade detections, and greater variability in inter-saccadic intervals, ultimately reducing data reliability.²⁷ Indeed, future research should explore the potential of AI combined to the EyeLink 1000 Plus, which captures at up to 2000 Hz with exceptional precision. Up to date, this is considered the best ET tool.

Subgroup analysis according to data collection revealed clear differences in diagnostic performance for AD using AI-driven ET tools. Eye movement and oculomotor tasks showed moderate discriminative ability, whereas fixation maps and heatmaps proved more reliable for identifying AD cases and may be especially valuable for screening purposes. In contrast, speech, picture description, and cognitive tasks, with lower sensitivity but higher specificity, appear better suited for confirming the absence of AD rather than initial detection. These results suggest that combining different task types could enhance both detection and confirmation. However, study heterogeneity, wide 95% CI, and small sample sizes highlight the need for further comparative research.

Conversely, some uncertainty in our findings may arise from studies with a high risk of bias. In our meta-analysis, two studies were judged to have a high risk of bias and one had some concerns, primarily due to non-random or selective recruitment methods and bias of flow and timing. Thus, clear reporting and careful selection of training datasets are essential to ensure that AI-based tools for AD detection are beneficial across diverse healthcare systems. The characteristics of the training dataset, including factors like healthcare setting, cognitive profile, age, education level, gender, and race, should be documented.

Subgroup analysis by AI technique indicated that DL appeared to perform better than SML, achieving notably higher sensitivity, specificity, and DOR with insignificant heterogeneity. However, this result should be interpreted with caution, taking into account differences in participant qualifications, AD diagnostic criteria, and the ET tasks and systems used across studies. Similarly to our results, Ding et al. reported that DL models- based on speech analysis dominate in AD detection, outperforming traditional ML approaches.²⁸ Indeed, these findings could be due to the different algorithms used by SML and DL. ML relies on manually engineered features such as fixation duration and saccade patterns.²⁹ While effective, this approach may miss subtle and complex patterns inherent in ET data.²⁹ However, DL automatically extracts features from raw data like heatmaps or numeric data, capturing intricate temporal and spatial patterns. DL is considered the most widely used diagnostic method due to its capacity to manage complex decision-making processes.³⁰ Likewise, it was identified to surpass ML in detecting AD and dementia based on brain imaging methods.^31,32

In our meta-analysis, two studies used CNNs, one of which also employed DSAE, while another study used DSAE alone. These autoencoder introduce a sparsity constraint on the hidden units during training, encouraging the model to learn more efficient representations.²⁹ Denoising autoencoder trained to reconstruct the original input from a corrupted version, enhancing the model's robustness to noise.²⁹ Given the limited number of studies investigating DL in ET-based AD detection, there is a clear need to encourage further research in this area. However, it is important to note that the small sample size of included studies can cause overfitting, particularly in AI algorithms of greater complexity such as DL.²⁹ Overfitting occurs when the algorithm memorizes the training data and could not adapt to newer cases.²⁹ DL excels in handling large, complex datasets and capturing intricate patterns, leading to higher accuracy in predictions.²⁹ The choice between ML and DL depends on the specific requirements of the application, including data availability, computational resources, and the need for model interpretability. Figure 5 illustrates a comparison between SML and DL technique, for diagnosing AD based on ET data.

Figure 5.

Comparison between SML and DL for ET-based AD diagnosis.

Our analysis revealed that among the classification algorithms, DTC, SVM, and LR demonstrated moderate to good diagnostic performance for AD detection. However, DTC and LR exhibited high heterogeneity and wide CI, indicating variable and less precise estimates. Despite variations in ET tasks, AD diagnostic criteria, and study populations, the SVM models consistently demonstrated moderate to good performance with minimal heterogeneity. This can suggest their potential as a reliable tool for AD detection across diverse settings. Consistent with our findings, Kaczorowska et al. reported that LR and SVM had the best performance in classifying cognitive workload levels based on ET data, achieving accuracy rates of 0.95 ± 0.05, and 0.94 ± 0.05, respectively.³³ In a systematic review including studies on classification (cognitive states, intentions, actions, or events) using ET technology, it was found that SVM substantially shows a better performance in ET classification compared to other classifiers.³⁴ SVM are particularly effective in early AD detection when the feature space is clear and well-defined.³⁵ This effectiveness stems from SVMs’ ability to identify optimal hyperplanes that separate data into distinct classes, making them suitable for early studies in detecting AD.³⁵

Clinical implication of AI-driven ET tools for AD detection

In contrast to conventional diagnostic methods, which typically rely on subjective clinical assessments or invasive neuroimaging that may not be suitable for routine screenings, AI-driven ET tools is a personalized monitoring, non-invasive, moderately cost-effective, providing rapid, objective measures of visual attention and cognitive function, which can be useful for early detection. Moreover, AI-driven ET technology can be implemented using widely available devices, including smartphones, making it more accessible for widespread screening. Early identification through routine screenings would allow for early intervention, which is critical in slowing the progression of AD and improving patient outcomes. Although our meta-analysis indicates that DL integrated with high-frequency and high-resolution ET tools may offer the highest accuracy and performance for AD detection, further studies with larger datasets are needed. However, comparative studies with other non-invasive techniques are warranted in the future to establish which tool provides the greatest benefit for patient. In fact, while ET tools offer several advantages, their use requires specialized equipment and standardized tasks, which may limit broad implementation. Additionally, results can be affected by participant factors such as fatigue, visual impairments, or poor calibration, potentially reducing accuracy. Speech analysis is also non-invasive and can be conducted remotely with low cost, making it highly accessible. It effectively captures AD in early stage through linguistic and vocal features, though its accuracy can be influenced by language, education, and cultural factors. Blood biomarkers, as a minimally invasive approach, can identify AD at early or even preclinical stages. Yet, they require laboratory infrastructure, and are more expensive, which may restrict their use in some clinical settings. For the accuracy of these tools combined to AI, according to a recent narrative review, speech analysis models achieved 69.60–97.2% accuracy.²⁸ Serum metabolomics is especially promising; a 14-metabolite panel with SML algorithms reached 100% accuracy in the discovery cohort and 97% in validation.³⁶ Combining blood metabolites with DL, XGBoost, and random forest models yielded AUCs of 0.85–0.88.³⁷ Overall, while blood biomarkers are most accurate and biologically grounded, ET and speech analysis offer more practical and scalable, particularly for initial screening.

Limit of the study

Our study presents some limitations that could be addressed in future meta-analyses. Although our search encompassed a wide range of datasets and grey literature, we identified only a limited number of studies addressing the issue. Some of our pooled estimates exhibited substantial heterogeneity, notably I² > 80%, which may limit the robustness and generalizability of our findings. Although subgroup analyses were performed to explore potential sources, residual heterogeneity persisted, warranting cautious interpretation of the results. While most metrics exhibited narrow 95% CI, indicating good precision, certain estimates in the subgroup analyses displayed extremely wide intervals, reflecting considerable statistical uncertainty. These results should be interpreted with caution, as they may lead to over-interpretation and potential over-estimation of the AI performance. All included studies were hospital-based, retrospective, and involved relatively small sample sizes. Thus, results could be biased toward patients who are sicker or have more severe conditions, limiting their applicability to broader groups.

Retrospective data may have drawbacks related to data quality and availability, as well as recall bias. Small sample size might decrease the reliability of model performance. Additionally, all the studies included in our analysis relied on internal validation, which may limit the generalizability of the results. Indeed, the absence of data on confounding factors such as education level, age, and comorbidities may impact the robustness of the results. Variability in eye movement patterns related to age, comorbidities, and cultural differences should also be considered.

Conclusion

Based on this exploratory meta-analysis including hospital-based and case-controls studies, we cautiously conclude that AI-driven ET tools appear to have moderate to good accuracy in detecting AD. DL appears to outperform traditional SML approaches. Within SML models, regression methods tend to perform better than classification methods, with SVM seems to be the most effective classification algorithm. Factors such as patient preparation, measurement methods, task type, and data collection procedures likely influence model performance, highlighting the need to consider these elements when interpreting diagnostic accuracy. Therefore, future studies should adopt standardized protocols, include participants from the general population, and conduct external validation. Transparent reporting is also essential to enhance the accuracy and reliability of AI-based ET tools for AD detection.

Supplemental Material

sj-docx-1-alz-10.1177_13872877251389145 - Supplemental material for Artificial intelligence-driven eye tracker models for Alzheimer's disease diagnosis: A systematic review and meta-analysis

Supplemental material, sj-docx-1-alz-10.1177_13872877251389145 for Artificial intelligence-driven eye tracker models for Alzheimer's disease diagnosis: A systematic review and meta-analysis by Imen Ketata and Emna Ellouz in Journal of Alzheimer's Disease

Footnotes

Acknowledgements

The authors have no acknowledgments to report.

ORCID iDs

Imen Ketata

Emna Ellouz

Ethical considerations

The protocol was registered in PROSPERO (CRD420251020284).

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Author contribution(s)

Imen Ketata: Conceptualization; Data curation; Formal analysis; Investigation; Methodology; Software; Writing – original draft.

Emna Ellouz: Conceptualization; Data curation; Methodology; Supervision; Validation; Writing – review & editing.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

The datasets analyzed during the current study are available from the corresponding author on reasonable request.

Supplemental material

Supplemental material for this article is available online.

References

Chavan

Patil

Patel

, et al. A comprehensive review on Alzheimer’s disease its pathogenesis, epidermiology, diagnostics and treatment. Int J Res Appl Sci Biotechnol 2023; 2: 66–67.

Zhou

Dong

, et al. Noninvasive automatic detection of Alzheimer’s disease from spontaneous speech: a review. Front Aging Neurosci 2023; 15: 1224723.

Dokholyan

Mohs

Bateman

. Challenges and progress in research, diagnostics, and therapeutics in Alzheimer’s disease and related dementias. Alzheimers Dement (N Y) 2022; 8: e12330.

Nestor

Scheltens

Hodges

. Advances in the early detection of Alzheimer’s disease. Nat Med 2004; 10(Suppl): S34–S41.

Chimthanawala

NMA

Haria

Sathaye

. Non-invasive biomarkers for early detection of Alzheimer’s disease: a new-age perspective. Mol Neurobiol 2024; 61: 212–223.

Zhang

Wei

, et al. Early candidate urine biomarkers for detecting Alzheimer’s disease before amyloid-β plaque deposition in an APP (swe)/PSEN1dE9 transgenic mouse model. J Alzheimers Dis 2018; 66: 613–637.

Pereira

MLGF

Camargo

MVZA

Bellan

AFR

, et al. Visual search efficiency in mild cognitive impairment and Alzheimer’s disease: an eye movement study. J Alzheimers Dis 2020; 75: 261–262.

Lage

López-García

Bejanin

, et al. Distinctive oculomotor behaviors in Alzheimer’s disease and frontotemporal dementia. Front Aging Neurosci 2021; 12: 603790.

Field

Newton-Mason

Shajan

, et al. Machine learning analysis of speech and eye tracking data to distinguish Alzheimer’s clinic patients from healthy controls: biomarkers: searching for Alzheimer’s through the eyes. Alzheimers Dement 2020; 16(Suppl 5): e046742.

10.

Biondi

Fernandez

Castro

, et al. Eye movement behavior identification for Alzheimer’s disease diagnosis. J Integr Neurosci 2018; 17: 349–354.

11.

Lin

Huang

, et al. Early detection of Alzheimer’s disease through eye movement analysis: a digital diagnostic approach. In: 2024 IEEE international workshop on electromagnetics: applications and student innovation competition, iWEM 2024 – Taoyuan, Taiwan. Piscataway, NJ: Institute of Electrical and Electronics Engineers Inc., 2024.

12.

Song

Huang

Liu

, et al. Diagnostic potential of eye movements in Alzheimer’s disease via a multiclass machine learning model. Cogn Comput 2024; 16: 3364–3378.

13.

Heidarzadeh

Ratté

. Eye-tracking’ with words for Alzheimer’s disease detection: time alignment of words enunciation with image regions during image description tasks. J Alzheimers Dis 2023; 95: 855–868.

14.

Zuo

Jing

Sun

, et al. Deep learning-based eye-tracking analysis for diagnosis of Alzheimer’s disease using 3D comprehensive visual stimuli. IEEE J Biomed Health Inform 2024; 28: 2781–2793.

15.

Sun

Liu

, et al. A novel deep learning approach for diagnosing Alzheimer’s disease based on eye-tracking data. Front Hum Neurosci 2022; 16: 972773.

16.

Yan

, et al. Construction of a prediction model for Alzheimer’s disease using an AI-driven eye-tracking task on mobile devices. Aging Clin Exp Res 2024; 37: 9.

17.

Liu

Yang

, et al. The effectiveness of eye tracking in the diagnosis of cognitive disorders: a systematic review and meta-analysis. PLoS One 2021; 16: e0254059.

18.

Norouzi

Kafieh

Chazot

, et al. Insights from the eyes: a systematic review and meta-analysis of the intersection between eye-tracking and artificial intelligence in dementia. Aging Ment Health 2025; 29: 1367–1375.

19.

Monaghan

Rahman

Agudelo

, et al. Foundational statistical principles in medical research: sensitivity, specificity, positive predictive value, and negative predictive value. Medicina (Kaunas) 2021; 57: 503.

20.

Balduzzi

Rücker

Schwarzer

. How to perform a meta-analysis with R: a practical tutorial. Evid Based Ment Heal 2019; 22: 153–160.

21.

Deeks

Higgins

JPT

Altman

DG.

Analysing data and undertaking meta-analyses. In: Higgins JPT, Thomas J, Chandler J, et al. (eds) Cochrane handbook for systematic reviews of interventions. 2nd ed. Chichester, UK: John Wiley & Sons, 2019, pp.241–284.

22.

Yin

Wang

Liu

, et al. Internet of things for diagnosis of Alzheimer’s disease: a multimodal machine learning approach based on eye movement features. IEEE Internet of Things J 2023; 10: 11476–11485.

23.

Liu

Zhang

Wang

, et al. Depth-induced saliency comparison network for diagnosis of Alzheimer’s disease via jointly analysis of visual stimuli and eye movements. arXiv 2024. DOI: 10.48550/arXiv.2403.10124 [Preprint]. Submitted 15 Mar 2024.

24.

Stogiannis

Siannis

Androulakis

. Heterogeneity in meta-analysis: a comprehensive overview. Int J Biostat 2023; 20: 169–199.

25.

Chelnokova

Laeng

. Three-dimensional information in face recognition: an eye-tracking study. J Vis 2011; 11: 27.

26.

Boujelbane

Trabelsi

Salem

, et al. Eye tracking during visual paired comparison tasks: a systematic review and meta-analysis of the diagnostic test accuracy for detecting cognitive decline. J Alzheimers Dis 2024; 99: 207–221.

27.

Raynowska

Rizzo

Rucker

, et al. Validity of low-resolution eye-tracking to assess eye movements during a rapid number naming task: performance of the EyeTribe eye tracker. Brain Inj 2018; 32: 200–208.

28.

Ding

Chetty

Noori

, et al. Speech based detection of Alzheimer’s disease: a survey of AI techniques, datasets and challenges. Artif Intell Rev 2024; 57: 325.

29.

Quek

Heikkonen

Lau

. Use of artificial intelligence techniques for detection of mild cognitive impairment: a systematic scoping review. J Clin Nurs 2023; 32: 5752–5762.

30.

Kaur

Singla

Nkenyereye

, et al. Medical diagnostic systems using artificial intelligence (AI) algorithms: principles and perspectives. IEEE Access 2020; 8: 228049–228069.

31.

Arya

Verma

Chakarabarti

, et al. A systematic review on machine learning and deep learning techniques in the effective diagnosis of Alzheimer’s disease. Brain Inform 2023; 10: 17.

32.

Tsironis

. Regression and classification. In: Tsironis

(ed.) Artificial intelligence and complex dynamical systems. Understanding complex systems. Cham: Springer, 2025, pp.21–38.

33.

Kaczorowska

Plechawska-Wójcik

, et al. Interpretable machine learning models for three-way classification of cognitive workload levels for eye-tracking features. Brain Sci 2021; 11: 210.

34.

Lim

Mountstephens

Teo

. Eye-Tracking feature extraction for biometric machine learning. Front Neurorobot 2022; 15: 796895.

35.

Jahan

Abu Taher

Kaiser

, et al. Explainable AI-based Alzheimer’s prediction and management using multimodal data. PLoS One 2023; 18: e0294253.

36.

Zhao

Villasante-Tezanos

Miranda-Morales

, et al. Discovery of novel metabolic biomarkers in blood serum for diagnosis of Alzheimer’s disease. J Alzheimers Dis 2024; 102: 237–253.

37.

Stamate

Kim

Proitsi

, et al. A metabolite-based machine learning approach to diagnose Alzheimer-type dementia in blood: results from the European medical information framework for Alzheimer disease biomarker discovery cohort. Alzheimers Dement 2019; 5: 933–938.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.07 MB

0.00 MB