Abstract
Atypical gaze patterns are consistently reported in autism, reflecting differences in social attention and interest. Gaze-tracking paradigms provide an objective way to quantify these differences and may serve as early indicators of autism. This diagnostic test accuracy systematic review and meta-analysis evaluated the performance of eye-tracking-based gaze measures in children. Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses of Diagnostic Test Accuracy (PRISMA-DTA) guidance, studies published between 2015 and 2025 that compared gaze-tracking paradigms with standardized autism diagnoses were synthesized. Pooled diagnostic odds ratio (DOR), sensitivity, and specificity were estimated using random-effects and hierarchical summary receiver operating characteristic models. Risk of bias was assessed with QUADAS-2 and funnel plots. Seventeen studies (n = 4,256) from six countries met the inclusion criteria. Tasks included social-geometric preference, motherese-nonsocial speech, and visual-orienting paradigms analyzed with rule-based or machine-learning methods. The pooled area under the hierarchical summary receiver operating characteristic curve (HSROC AUC) was 0.845; DOR 15.03 (95% CI 8.00–28.50); sensitivity 0.77 (95% CI 0.65–0.85); and specificity 0.80 (95% CI 0.75–0.84). Although heterogeneity was high (I2 = 87.78%), effect directions were consistent. Dynamic social stimuli and higher-frequency tracking systems achieved the best performance. Gaze-tracking tests distinguished autistic and nonautistic children across diverse settings, supporting their potential role as a quantitative, observer-independent adjunct for early identification and clinical decision support.
Lay abstract
Autism is a form of neurodiversity characterized by differences in social communication, sensory processing, and patterns of attention and interest, which often shape how autistic people look at and interpret the world around them. Eye-tracking technology records where a person looks on a screen and how long their gaze remains on elements, such as people, faces, or objects. Because it is objective and does not rely on language or complex instructions, eye-tracking may support earlier identification of autism. This study reviewed 17 research papers published between 2015 and 2025 that explored how eye-tracking distinguishes autistic and nonautistic children. Together, these studies included over 4,000 participants and compared attention to social scenes, like people talking or playing, with attention to nonsocial or geometric patterns. On average, eye-tracking correctly identified autism about 77% of the time and nonautistic children about 80% of the time, with the best results achieved with dynamic social videos and high-quality tracking cameras. These findings suggest that gaze-based measures capture meaningful differences in social attention and could complement existing diagnostic approaches through earlier, more objective assessment.
Introduction
Autism is a neurodevelopmental condition, defined by persistent difficulties in social communication and interaction across multiple contexts, alongside restricted, repetitive patterns of behaviors, interests, or activities, and atypical sensory responsivity. These are present from early developmental stages, although functional impairment may only become evident if social demands surpass adaptive capacity (American Psychiatric Association, 2022).
Clinical presentation is highly heterogeneous, complicating early recognition and delaying diagnosis (Masi et al., 2017). Reliable identification is possible from 18 to 24 months (Dawson et al., 2023), yet most children are diagnosed near age 4, with wide cross-national and sociodemographic variation (Fombonne et al., 2016; Maenner et al., 2023). Such delays are associated with reduced access to early intervention and poorer developmental outcomes (Brian et al., 2019; Lord et al., 2018).
From a public health perspective, the global prevalence of autism is estimated at around 0.7% to 1% of children, though estimates vary by region and methodological approach (Baxter et al., 2015; Zeidan et al., 2022).
Efforts to reduce diagnostic delay have driven extensive biomarker research. A systematic review identified over 900 candidate biomarkers across molecular, neurophysiological, and behavioral domains; however, none have yet demonstrated adequate clinical validity for inclusion in diagnostic algorithms (Parellada et al., 2023; Zhuang et al., 2024). Among these, gaze behavior has emerged as a promising candidate because it is noninvasive and directly linked to social attention, a hallmark of autism.
Atypical visual attention is among the earliest and most consistent features of autism. Studies show reduced fixation on eyes and faces, diminished preference for biological motion, and altered attention to social scenes, alongside heightened interest in geometric or repetitive stimuli. These gaze differences emerge within the first year of life and can predict later social and language outcomes, supporting their potential as early behavioral markers of neurodevelopmental divergence (Elsabbagh et al., 2013, 2014; Tönsing et al., 2025).
Recent studies employing dual eye-tracking during live interactions have revealed atypical behavior in autism, characterized by less frequent initiation and greater avoidance of eye contact (Tönsing et al., 2025). Large-scale consortia have reported good diagnostic performance of gaze-based metrics for clinical trial applications, due to the feasibility of valid data acquisition, verification of construct performance, and stability over 6 weeks (Shic et al., 2022). Gaze-based measures offer practical advantages over molecular or neuroimaging biomarkers: eye tracking is noninvasive, relatively brief, and well tolerated by young children (Falck-Ytter et al., 2013).
Technological advances have increased the translational potential of gaze-tracking. Traditional research-grade systems provide high spatial and temporal precision but are costly and limited to laboratory settings (Falck-Ytter et al., 2013). Gaze-tracking paradigms record visual attention to structured social and nonsocial stimuli, such as faces, scenes, or biological motion, and analyze fixation duration, gaze preference, or scanpath patterns (Papagiannopoulou et al., 2014). Derived metrics, whether rule-based or machine-learned, capture atypical social attention patterns characteristic of autism (Chita-Tegmark, 2016).
This line of research is particularly valuable given that gaze-tracking is objective, noninvasive, scalable (requiring minimal operator training), and potentially automatable (Klin et al., 2015). In contrast to conventional screening tools that rely on caregiver reports or clinical observation, gaze-tracking provides objective, quantifiable indices of social attention, thereby minimizing subjectivity and cultural bias. Understanding the classification accuracy of these tasks across contexts, technologies, and developmental stages is essential before integration into primary care workflows or utilization as a triage instrument.
Despite its promise, the literature remains fragmented, with studies differing substantially in paradigm design, measurement approaches, and analytic strategies. Reported diagnostic performance varies considerably in primary (author-selected) classification approaches, and the relative merits of distinct paradigms (e.g., faces vs. geometric shapes, static vs. dynamic stimuli) remain unclear. This heterogeneity underscores the need for a systematic synthesis of diagnostic test accuracy evidence. The present review aims to evaluate the current available evidence on gaze-tracking paradigms for classifying autism, to clarify screening and diagnostic performance across tasks, devices, and settings.
This diagnostic test accuracy systematic review and meta-analysis followed PRISMA-DTA to address the review question: among individuals aged 12 months to 18 years, how accurately do gaze-tracking paradigms classify autism compared with validated diagnostic standards? The primary objective was to synthesize diagnostic performance across tasks and settings by summarizing paradigm characteristics and estimating pooled sensitivity and specificity using hierarchical models, while assessing risk of bias (ROB) with QUADAS-2.
Methodology
The protocol was approved by the Research, Ethics, and Biosafety Committees of our institution prior to study initiation. External registration (e.g., PROSPERO) was not pursued.
Eligible studies included children and adolescents (12 months to 18 years) who completed gaze-tracking paradigms and were evaluated for diagnostic classification of autism against a validated reference standard. Accordingly, all included studies enrolled both reference-standard positive (autism) and reference-standard negative (nonautistic) participants, permitting construction of 2 × 2 classification outcomes (Table 1). Index tests comprised gaze-tracking measures derived from either handcrafted metrics (e.g., fixation duration, gaze preference, scanpath measures) or algorithmic classifiers (e.g., machine-learning models). Reference standards included validated diagnostic assessments (e.g., Autism Diagnostic Observation Schedule–2nd Edition [ADOS-2], Autism Diagnostic Interview–Revised [ADI-R]) and clinical consensus diagnosis based on Diagnostic and Statistical Manual of Mental Disorders (5th edition; DSM-5) criteria. Studies were required to report diagnostic accuracy outcomes (e.g., sensitivity/specificity, area under the curve [AUC], likelihood ratios) or provide sufficient data to derive these metrics. We included peer-reviewed English-language articles published 2015–2025 and excluded case reports, reviews, editorials, conference abstracts, and other nonpeer-reviewed records. We restricted the search to studies published from January 2015 onward to align with the DSM-5 diagnostic criteria (American Psychiatric Association, 2013), the updated generation of autism diagnostic tests, and current hardware and analytical approaches, supporting a synthesis oriented toward contemporary and future clinical translation.
Study-level characteristics and primary diagnostic-accuracy metrics of the 17 included gaze-tracking studies for autism in children.
Note. n: number; TD: Typical Development; y: year, mo: months.
Sensitivity and specificity inferred from Figure 4 at Youden’s J point.
A comprehensive search was conducted across the PubMed (MEDLINE), Scopus, Web of Science, and APA PsycINFO databases. The strategy combined controlled vocabulary and free-text terms related to autism, child populations, gaze or eye tracking, diagnostic standards, and accuracy metrics. Search filters were restricted to English-language studies published since January 2015. The complete list of search queries is provided in Supplement 1.
All retrieved records were imported into ASReview (version 2.1.1), an open-source machine-learning framework for systematic reviews (Van De Schoot et al., 2021). The software was used as a collaborative screening interface and to randomize the initial screening order. All records were screened manually through a sequential review of titles and abstracts. Duplicate entries were removed prior to screening. The screened records were then analyzed by full-text eligibility assessment. Discrepancies between reviewers were resolved through discussion and consensus.
Data extraction was independently conducted by three reviewers using a standardized form, with verification by a fourth investigator to ensure accuracy and consistency. Extracted variables included study characteristics (design, sample size, age range, and setting), index test details (task type, stimulus, device model, sampling rate, calibration method, and analysis software), algorithmic approach (rule-based or machine-learning), and diagnostic performance measures (AUC, sensitivity, specificity, predictive values, and likelihood ratios). When studies reported multiple thresholds or subgroup-specific results (e.g., by sex or stimulus type), the metric representing the study’s primary or overall classification performance, typically the one designated or optimized by the authors, was selected. Each study was therefore treated as a single data point reflecting its most representative diagnostic accuracy estimate.
ROB and applicability were independently evaluated by two reviewers using the QUADAS-2 tool (Whiting et al., 2011), adapted to the context of gaze-tracking tasks for autism screening. Each study was assessed across the four domains of the tool: patient selection, index test, reference standard, and flow and timing. Discrepancies were resolved by consensus or, when necessary, through arbitration by a third reviewer. To minimize potential bias, the two reviewers responsible for this assessment did not participate in the data extraction process. Visualization of QUADAS-2 results was performed using the RobVis tool (McGuinness & Higgins, 2021).
Statistical Analysis
All data processing and descriptive analyses were conducted using Python 3.13 with pandas (v2.3.3) (The Pandas Development Team, 2025) and SciPy (v 1.16.1) (Virtanen et al., 2020) for descriptive statistics. When reconstructing the confusion matrices, the modified Haldane-Anscombe correction was applied by adding 0.5 to matrices that contained zero values, following the recommendation of Weber et al. (2020). For the random-effects meta-analysis of log diagnostic odds ratios (logDOR), study-level standard errors were derived from reconstructed 2 × 2 tables and inverse-variance weights were applied (weights shown in Figure 4).
Heterogeneity assessment, forest plots, and summary statistics were examined in JASP (version 0.95.3). The hierarchical summary receiver operating characteristic (HSROC) model was estimated in R (version 4.4.1) using the mada package (version 0.5.12) (Doebler, 2012), which employs the Reitsma bivariate random-effects framework to model sensitivity and specificity jointly. Between-study variability was summarized using τ and I2 statistics, and 95% confidence and prediction intervals were calculated to describe uncertainty around pooled estimates.
Results
A total of 146 records were identified through database searches. After removal of duplicates, 105 papers were screened, resulting in 21 studies selected for full-text assessment (84/105 [80%] were excluded at title and abstract screening). Following a detailed assessment, four studies were excluded for not reporting diagnostic accuracy metrics (sensitivity, specificity, or AUC metrics) and did not provide enough data to derive them. Consequently, 17 peer-reviewed studies were included in the final qualitative and quantitative synthesis (Figure 1). We did not code mutually exclusive exclusion reasons at title/abstract screening; reasons are therefore summarized qualitatively in Figure 1.

Study Selection. PRISMA 2020 flow diagram summarizing study selection (Page et al., 2021).
The combined sample across included studies comprised 2,083 autistic children and 2,173 nonautistic controls, yielding a total of 4,256 subjects and a near-balanced case:control ratio of 0.96:1, across multiple-country settings (United States, China, Qatar, Peru, Australia, and France). Participant age varied substantially across cohorts: most studies enrolled toddlers or preschoolers (mean age < 6 years), whereas Frazier et al. (2016, 2018) and Al-Shaban et al. (2023) recruited broader age ranges extending into school age and early adolescence. Most studies were published between 2021 and 2024.
Included index tests spanned recurrent gaze-tracking task families, including social-versus-nonsocial preferential-looking paradigms (including GeoPref-type designs), motherese versus nonsocial speech preference, social-scene allocation/orienting tasks, and broader or multimodal batteries (e.g., gap-overlap, pupillary responses, EEG-integrated approaches). Across paradigms, gaze-derived measures included fixation proportion/dwell-time preference indices, area-of-interest transitions, saccade features, disengagement latency, and related composite metrics. Most studies used commercial remote eye trackers (predominantly Tobii, SMI/RED, or EyeLink systems; typically 60–300 Hz) with standardized five or nine-point calibration, although lower-cost/community-oriented implementations have also been explored; Jensen et al. (2021) was the only study without explicit child-by-child calibration. Analytic strategies ranged from rule-based thresholds and ROC-derived cutoffs (including thresholds selected to maximize overall accuracy or minimize false positives) to multivariable and machine-learning classifiers (e.g., logistic regression, random forests, CNN-based scanpath/saliency models), with some studies using hold-out or cross-validation and multimodal pipelines integrating EEG or broader clinical measures (e.g., ADOS-2; Modified Checklist for Autism in Toddlers, Revised [M-CHAT-R]; Vineland Adaptive Behavior Scales [VABS]). Detailed task characteristics, dependent-variable operationalization, hardware specifications, calibration procedures, and analytic features are provided in Supplementary Tables 2 and 4.
Table 1 presents the study-level characteristics and primary diagnostic-accuracy metrics for the 17 included studies, including location, design, sample sizes (autistic/nonautistic), age range, device/task summary, reference standard, and group details. Primary (author-selected) diagnostic performance varied widely across studies, with sensitivity ranging from 0.33 to 0.96, specificity from 0.63 to 0.95, and AUC from 0.73 to 0.93.
ROB and Applicability
As shown in Figure 2 and summarized in Figure 3, across the 17 included studies, the ROB was most frequently rated as high in the domains of patient selection and index test, reflecting the widespread use of two-gate case-control designs and nonblinded interpretation of index test results. Less than half of studies employed single-cohort or population-based sampling strategies (Frazier et al., 2016, 2018; Jones et al., 2023; Keehn et al., 2024; Moore et al., 2018; Pierce et al., 2023; Wang et al., 2024; Wen et al., 2022).

Risk of bias visualization. Judgments across QUADAS-2 domains: D1, patient selection; D2, index test; D3, reference standard; D4, flow and timing. Green (+) = low, yellow (−) = some concerns, red (x) = high.

Risk of bias summary. Summary of risk of bias across included studies according to QUADAS-2 domains. Green (first) indicates low risk, yellow (second) some concerns, and red (third) high risk.
Concerns related to flow and timing were identified as the second cause of high ROB. Seven out of the seventeen studies were rated at high ROB, primarily due to differential verification or nonuniform application of the reference standard, most often in two-gate case-control designs or where diagnostic status was established outside the study procedures (Al-Shaban et al., 2023; Cilia et al., 2021; de Belen et al., 2024; He et al., 2021; Meng et al., 2023; Sun et al., 2023; Wen et al., 2022).
Three cohort studies were rated as having some concerns in this domain, due to incomplete reporting of participant exclusions and/or test sequencing, including pooled sequential recruitment periods (Frazier et al., 2016), unclear exclusions (Pierce et al., 2023), and limited reporting on the interval and ordering of index and reference assessments (Wang et al., 2024). The remaining 7/17 studies were judged at low risk for this domain.
The study by Wen et al. (2022) was assigned a high ROB from the domain of timing and flow since 444 subjects of their total sample came from previous studies and not from their population-based sampling (Pierce et al., 2011, 2016); moreover, a significant portion came from a 2011 study before the ADOS-2 was published (Lord et al., 2012).
Overall applicability concerns were low, as most paradigms directly addressed early detection of autism through eye-tracking–based visual engagement tasks, using clinically confirmed diagnoses (ADOS-2, DSM-5) as reference standards.
Meta-Analysis
Figure 4 presents the results of the random-effects meta-analysis, yielding a pooled log diagnostic odds ratio (logDOR) of 2.71 (95% CI: 2.08–3.35, p < 0.001), corresponding to a diagnostic odds ratio (DOR) of 15.03 (95% CI 8.00–28.50). Thus, autistic participants were, on average, over 15 times more likely to be correctly classified by gaze-tracking paradigms than typically developing controls, indicating strong discriminative ability across paradigms.

Forest plot of the meta-analysis of gaze-tracking tasks for autism identification, grouped by overall risk of bias (high, orange; low, blue).
Studies were grouped according to their ROB assessment. Despite wider confidence intervals in the higher ROB subgroup, the pooled subgroup logDORs were broadly similar, suggesting limited differences in summary accuracy by ROB in this analysis.
Moderate to high heterogeneity was observed, Q(16) = 147.05, p < 0.001; I2 = 87.78%, reflecting variability in sample size, stimulus type (social vs. geometric; dynamic vs. static), and analytic strategies (rule-based vs. machine learning). Despite this variability, all effects estimated pointed in the same direction.
Figure 5 displays the funnel plot evaluating publication bias and small-study effects. The distribution was largely symmetrical around the pooled effect, suggesting no substantial asymmetry or selective reporting. One study (Sun et al., 2024) appeared as a minor outlier, consistent with its visual deviation in the forest plot and its comparatively small sample size and unclear sampling strategy (Figure 4).

Residual funnel plot.
A hierarchical summary receiver operating characteristic (HSROC, Figure 6) meta-analysis was conducted using the Reitsma bivariate random-effects model. As seen in Figure 6, the pooled sensitivity was 0.77 (95% CI 0.65–0.85) and the specificity of 0.80 (95% CI 0.75–0.84), indicating balanced diagnostic accuracy across studies. The summary AUC was 0.845, and the normalized partial AUC was 0.717, reflecting good overall discrimination between autistic children and typically developing children. Between-study variance was substantial (τ = 1.116 for sensitivity and 0.492 for specificity), suggesting methodological and paradigm-related heterogeneity. The positive correlation between sensitivity and false-positive rate (ρ = 0.519) denotes threshold variability among studies.

Hierarchical summary receiver operating characteristic (HSROC) curve.
Finally, sensitivity analysis (Supplement 3) excluding Wen et al. (2022), which accounted for approximately 43% of the total cohort and was identified as having potential risks of bias, yielded a pooled LogOR of 2.83 (95% CI: 2.22–3.44; p < 0.001), corresponding to an OR of 16.95. This represents a minor increase in the OR from the primary analysis, with largely overlapping confidence intervals. The pooled sensitivity and specificity also remained highly consistent at 0.79 for sensitivity and 0.8 for specificity, confirming the robustness of the primary meta-analysis finding.
Among the 17 included studies, only two (11.8%) reported preregistration or trial registration (Jones et al., 2023; NCT03469986; Kou et al., 2019; NCT03286621). Only Jones et al. (2023) reported a prespecified diagnostic decision threshold, corresponding to 5.9% of the included studies.
Discussion
Gaze behavior has emerged as a promising and objective biomarker for autism. Over the past two decades, advances in eye-tracking technology have enabled precise, real-time measurement of gaze direction, pupillary responses, saccades, and scanpath dynamics. When combined with carefully designed visual or auditory paradigms, these technologies generate quantitative indices of social attention and information processing that can assist clinicians in screening and diagnostic assessment (Chita-Tegmark, 2016; Papagiannopoulou et al., 2014).
This diagnostic test accuracy review included a large and culturally diverse sample (n = 4,256), encompassing studies from North and South America, Europe, Oceania, and Asia, and found that gaze-tracking paradigms discriminated autistic from nonautistic children across diverse tasks, devices, and settings. Despite substantial methodological heterogeneity, the hierarchical summary ROC indicated strong overall performance (pooled HSROC AUC = 0.845), supporting gaze behavior as a promising objective marker for autism. Consistency of findings across cohorts suggests that gaze-based indices capture attentional features relevant to autism across multiple languages and cultural contexts, although the available evidence remains uneven across regions and settings.
Nine of the 17 included studies were rated as high ROB. The most frequent source was study design, particularly two-gate case-control sampling, which can inflate apparent diagnostic performance through spectrum and selection effects (Reitsma et al., 2023; Whiting et al., 2011). Additional concerns included incomplete or unclear blinding, partial or delayed application of reference standards, and insufficient reporting of index–reference timing, with risk of differential verification bias (Reitsma et al., 2023; Whiting et al., 2011).
Because only two studies reported preregistration and only one used a prespecified threshold, the large analytic degrees of freedom typical of eye-tracking pipelines may have contributed to optimistic accuracy estimates through post hoc operating-point selection.
These limitations do not negate the observed signal, but they reduce direct transportability of reported accuracy to routine clinical pathways; accordingly, pooled performance should be interpreted with greater weight on single-gate cohorts with uniform reference-standard application and prespecified thresholds (Reitsma et al., 2023).
Wen et al. (2022) reported a prospective single-gate cohort of 1,863 toddlers (12–48 months) identified through universal primary-care screening and assessed with the GeoPref paradigm. Using the final model proposed by the authors, classification performance was AUC = 0.76, specificity = 96%, sensitivity = 33%, and overall accuracy = 71%. The authors selected a threshold that prioritized a low rate of false positives, thus explaining the low sensitivity in light of the achieved AUC. Approximately 444 participants (~24%) were drawn from prior datasets by the same group, which may increase selection bias and raises potential duplication concerns for pooled analyses. A sensitivity analysis excluding this study confirmed consistent pooled results (Supplement 3).
Beyond ROB, heterogeneity reflected genuine variation in index-test architecture, so gaze-tracking should be interpreted as a measurement modality applied to multiple partially distinct constructs rather than a single ‘social attention’ test. Paradigms varied by stimulus domain and format (static vs. dynamic), feature construction (single handcrafted metrics vs. composite or algorithmic classifiers), device and data-quality constraints, and threshold strategy. Despite this diversity, index tests clustered into recurring families: paired-stream preferential-looking competition (including social vs. geometric preference and biological-motion contrasts), gaze allocation within socially informative scenes (ROI dwell time, transitions, anticipatory looking), speech-directed attention (percent fixation preference for motherese vs. nonsocial controls), nonsocial interest capture, and domain-general oculomotor or arousal measures (for example disengagement latency, saccade dynamics, pupil reactivity) used alone or in multifeature batteries. Eligibility was defined by diagnostic test accuracy design with extractable 2 × 2 outcomes rather than restricting paradigms a priori; paradigm-level tasks and dependent-variable operationalizations are summarized in Supplementary Table 4 (see also Table 1).
Across these architectures, reported accuracy was sensitive to operating-point selection. GeoPref-style tasks prioritize specificity over sensitivity in unselected cohorts, with sensitivity recovering when saccade features are added (Wen et al., 2022). Motherese paradigms can be parameterized similarly by applying low fixation thresholds, producing very high specificity at the expense of sensitivity (Pierce et al., 2023). In elevated-likelihood clinical cohorts, multifactorial indices that integrate dwell time, switching, and vacancy patterns can recover sensitivity without substantial loss of specificity (Wang et al., 2024). Machine-learning approaches based on scanpaths or saliency-derived features often perform well within-study, but frequent reliance on enriched sampling and internal validation motivates external validation in single-gate cohorts to establish generalizability (Cilia et al., 2021; de Belen et al., 2024).
One approach that appears closest to clinical viability is the use of gaze-based paradigms as adjuncts to established screening workflows rather than as stand-alone diagnostics. The clearest translational example is the combination of gaze preference with M-CHAT-R in community settings, which supports an incremental-value model in which questionnaire-based pretest risk is refined with an objective behavioral signal (Jensen et al., 2021). This positioning is more realistic for routine care than replacing validated screeners and is consistent with threshold strategies that prioritize specificity in constrained clinical systems (Moore et al., 2018; Pierce et al., 2023; Wen et al., 2022).
A key interpretive limitation is incomplete control for developmental and language confounding. Across studies, IQ/DQ and language were inconsistently measured and were typically reported as cohort descriptors rather than incorporated into the diagnostic decision rule, which was often a fixed gaze threshold or an eye-tracking–only classifier (Al-Shaban et al., 2023; Jensen et al., 2021; Keehn et al., 2024; Moore et al., 2018; Pierce et al., 2023; Wang et al., 2024; Wen et al., 2022 Supplement 5). Consequently, part of the observed discrimination may reflect developmental level effects on attention and task engagement rather than autism-specific signal.
Real-world differential diagnosis remains incompletely established. Most studies compared autism with typically developing controls or heterogeneous nonautistic clinical samples, and only some included developmental-delay comparators (Al-Shaban et al., 2023; Frazier et al., 2016, 2018; Moore et al., 2018; Pierce et al., 2023; Wang et al., 2024). The included studies were not designed around prespecified, clinically relevant nonautistic neurodevelopmental comparator cohorts, despite evidence that gaze differences during socioemotional processing, including reduced attention to the eye region, can be transdiagnostic and vary across conditions (Martinez-Cedillo et al., 2026).
In addition, applicability to early pediatric screening should be interpreted cautiously because the evidence base is heterogeneous across pediatric age groups, with most cohorts restricted to toddlers or preschoolers, whereas Frazier et al. (2016, 2018) and Al-Shaban et al. (2023) extended recruitment into school-age and early adolescence. Developmental context and baseline gaze ecology differ substantially across these age bands. Although this heterogeneity supports robustness of the construct across settings, it weakens direct transportability to a single clinical pathway unless age-specific models and thresholds are externally validated.
Rather than pursuing a single universal paradigm, the most plausible route to clinical readiness is a workflow-integrated marker strategy for screening. In practice, this means embedding gaze-based outputs into existing pathways (e.g., M-CHAT centered triage) to provide incremental risk stratification, not replacing standardized instruments (Jensen et al., 2021). Within that framework, operating-point selection is central: the high-specificity/low-sensitivity configurations used in GeoPref-style models are not merely a limitation but a deliberate rule-in strategy that can reduce false positives, unnecessary referrals, and downstream diagnostic burden in low-prevalence screening settings (Moore et al., 2018; Wen et al., 2022). A second requirement for equitable deployment is hardware translation. Much of the field still depends on laboratory-grade eye trackers and tightly controlled acquisition conditions, whereas broader implementation will require robust pipelines on consumer-grade platforms (tablet or webcam-based systems) with explicit quality-control, calibration, and failure-rate reporting (Jensen et al., 2021; Vargas-Cuentas et al., 2017). Under this model, high-cost physiological extensions (e.g., pupillometry-derived indices) remain valuable for mechanistic enrichment, while scalable low-cost gaze tools drive real-world screening utility. Importantly, these tools should be regarded as adjunctive to, not replacements for, standardized instruments such as the ADOS-2 or the Modified Checklist for Autism in Toddlers, Revised (M-CHAT-R).
Beyond binary classification, several studies report clinically interpretable phenotype correlations. Reduced fixation to social-affective regions was associated with greater Autism Diagnostic Observation Schedule social-affect impairment and lower adaptive functioning (Frazier et al., 2016, 2018; Kou et al., 2019; Moore et al., 2018). In contrast, higher geometric preference and restricted-interest orienting were associated with greater restricted and repetitive behavior burden (Sun et al., 2023, 2024); while child-directed speech paradigms were influenced by language level and joint-attention development (Pierce et al., 2023; Wang et al., 2024). Together with the broad distribution of percent fixation to dynamic geometric images by Wen et al. (2022), these findings are compatible with dimensional heterogeneity and possible behavioral subgroups within autism. Clinically, this supports using gaze measures as phenotypic enrichers within screening and assessment workflows; however, claims about treatment-response prediction remain preliminary and require prospective longitudinal validation with adjustment for IQ/DQ and language.
One important limitation of this work is the restrictiveness of the queries used for retrieving relevant studies, which may have missed essential studies emphasizing biomarker validation or subtype identification. One noteworthy case is the study by Wen et al. (2022), which was not initially captured by automated database queries but was later identified through manual search and included in the final analysis.
However, the most important limitation of this diagnostic test accuracy meta-analysis is a potential model-selection and outcome-extraction bias: several studies reported multiple models or operating points, and we extracted the primary or final model emphasized by authors, which may over-represent optimized configurations and contribute to the predominance of AUC values above 0.70. Despite this, extracting the study-defined primary model provides a pragmatic summary of intended use, and the stability of pooled estimates in sensitivity analyses supports a reproducible discriminative signal, while absolute accuracy should be interpreted cautiously pending prospective single-gate validation.
The present findings support gaze behavior as a clinically relevant, objective marker of attentional allocation in autism, with plausible utility as an adjunct within tiered screening and diagnostic pathways rather than a stand-alone diagnostic. Translation to routine care will depend on prospective single-gate validation in representative clinical populations, prespecified thresholds and analysis plans, transparent reporting of failure rates and data quality, and demonstration of incremental value over established screening instruments. Parallel work is needed to adapt paradigms across languages and cultural contexts and to validate performance on scalable, low-cost eye-tracking platforms, to support equitable implementation for early autism identification and timely intervention.
Conclusion
Gaze-tracking paradigms demonstrated consistent diagnostic accuracy for autism across a wide range of experimental tasks, analytic methods, and populations, including infants under 2 years of age, although translation to clinical practice requires explicit measurement and adjustment for developmental level (DQ/IQ), language, and other clinically relevant nonautistic neurodevelopmental conditions that can influence gaze behavior. Quantitative gaze metrics can capture atypical social attention and restricted-interest patterns, supporting their promise as objective behavioral biomarkers when key methodological and clinical-translation requirements are met. Despite ongoing methodological heterogeneity, the pooled evidence indicates robust discriminative performance and substantial translational potential for early detection and clinical decision support, particularly as an adjunct within established screening and assessment workflows rather than as a stand-alone diagnostic test. Future research should emphasize prospective, single-gate validation studies with prespecified and preregistered thresholds and analytic pipelines, protocol standardization, explicit reporting of data-quality constraints and failure rates on scalable hardware, and multimodal integration to advance reproducible, scalable and clinically implementable applications of gaze-tracking in autism assessment.
Supplemental Material
sj-docx-1-aut-10.1177_13623613261451896 – Supplemental material for Gaze-Tracking-Based Tests for Autism in Children: A Diagnostic Test Accuracy Systematic Review and Meta-Analysis
Supplemental material, sj-docx-1-aut-10.1177_13623613261451896 for Gaze-Tracking-Based Tests for Autism in Children: A Diagnostic Test Accuracy Systematic Review and Meta-Analysis by Delaflor-Wagner Christian Alejandro, Suárez-Cuenca Juan Antonio, Alcaraz-Estrada Sofía Lizeth, Téllez-González Mario Antonio, Coral-Vázquez Ramón Mauricio, Toledo-Lozano Christian Gabriel and García Silvia in Autism
Footnotes
Acknowledgements
No professional writing or editorial assistance was received, and the submission was prepared and submitted by the authors themselves.
ORCID iDs
Ethical Considerations
The study was reviewed and approved by the Ethics and Research Committees of the National Medical Center “20 de Noviembre” (RPI code: RPI.CMN.138.2025).
Consent Publication
As this work is a systematic review and meta-analysis of previously published studies, informed consent was not required.
Author Contributions
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
Participatory Statement
This study was a systematic review and meta-analysis of published diagnostic test accuracy studies and did not involve recruitment, data collection, or direct interaction with participants. We did not formally involve autistic people, family members, advocates, clinicians, or other autism community representatives in formulating the research question, selecting outcomes, designing eligibility criteria, extracting data, conducting analyses, interpreting results, or drafting the manuscript. Accordingly, methodological choices, analytic decisions, and reporting were made by the author team, guided by established standards for diagnostic accuracy reviews and by the information available in the included publications. We present this transparently to clarify the scope of input informing the review and to support appropriate interpretation of the findings.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
