Abstract
Machine learning (ML), a core component of artificial intelligence (AI), is increasingly being used to assess children’s emotions and attention, with potential applications in developmental monitoring and early identification of neurodevelopmental conditions such as autism spectrum disorder (ASD) and attention deficit hyperactivity disorder (ADHD). This narrative review synthesizes studies published between 2012 and 2025 from PubMed, IEEE Xplore, and Web of Science. We examine multimodal data sources (including facial, speech, physiological, eye movement, and behavioral features) and computational approaches such as convolutional neural networks (CNNs), support vector machines (SVMs), and long short-term memory (LSTM) networks. These methods can capture behavioral and physiological signals and provide complementary information for assessing children’s emotional and attentional states, particularly in controlled settings. However, the current evidence remains heterogeneous, with many studies relying on limited or laboratory-based datasets, which may constrain real-world applicability. Key challenges include data bias, cross-cultural variability, ethical concerns, and the need for robust privacy protection and external validation. Recent work has explored integrating AI with virtual reality (VR), augmented reality (AR), and Internet of Things (IoT) technologies to support more adaptive monitoring systems. Nevertheless, these applications remain largely exploratory. Future research should prioritize real-world validation, pediatric-specific datasets, and interdisciplinary collaboration to better define the role of AI in children’s mental health and education.
Keywords
1. Introduction
1.1. The rise of artificial intelligence (AI) technology in children’s mental health
The rapid development of artificial intelligence (AI) has transformed daily life and is increasingly influencing mental health research involving children younger than twelve—a boundary chosen to approximate the end of middle childhood and reduce heterogeneity related to adolescence. 1 AI applications extend beyond an analysis of emotions and have shown promise in identifying psychological risk patterns, although their clinical applicability remains under active investigation.2–5 AI technologies can extract critical information from diverse sources, potentially supporting more objective assessments of children’s mental health, particularly in research or controlled settings. For example, methods such as voice recognition, sentiment analysis, and facial expression recognition can effectively capture emotional fluctuations in children, assisting mental health professionals in identifying potential early indicators of psychological risk in a timely manner, particularly with structured or multimodal data. 6
The most commonly used AI technologies today include support vector machine (SVM), 7 random forest, 8 and deep learning models. These techniques process large-scale data to support the prediction of behavioral patterns associated with psychological conditions, while their role in formal clinical diagnosis remains limited.9,10 AI technologies can support real-time emotional monitoring and may assist in detecting subtle emotional fluctuations, thereby contributing to early identification and potential intervention planning, although their effectiveness in real-world settings remains under investigation. 11 For example, AI algorithms can help identify early signs of anxiety, depression, or other psychological disorders by processing multidimensional signals such as children’s language patterns, vocal characteristics, and social interactions.12–14 Based on the results of these analyses, AI systems may support the generation of data-informed suggestions for intervention strategies, although such recommendations typically require validation and oversight by clinical professionals. The application of such AI technologies is profoundly important for children’s mental health, particularly as children often struggle to express their inner emotions and psychological states accurately through traditional methods such as questionnaires or face-to-face interviews. By integrating multimodal data, AI systems can continuously capture and analyze patterns in children’s daily behaviors, verbal communication, and physical gestures, helping professionals assess children’s mental health from multiple perspectives. 15 This approach may facilitate the early identification of individuals at potential risk of psychological disorders 16 and provides valuable data support for designing personalized treatment plans, making psychological interventions more targeted and effective.
Moreover, the application of these technologies to identify psychological disorders has been explored to support more personalized treatment planning. Using AI systems, professionals can track individual emotional fluctuation trends and develop dynamic treatment plans based on these changes, thereby potentially contributing to improved therapeutic planning. 3 For example, when an AI system detects abnormalities in a child’s emotional state, it may provide alerts to relevant professionals, supporting timely follow-up to mitigate potential deterioration.
In summary, while AI has good potential, its accuracy can be influenced by the quality of the dataset and cultural factors.17,18 With further advancements, these technologies are expected to support more personalized and potentially accurate assessment approaches and treatment support, 3 contributing to improved mental health outcomes for children. In the future, integrating AI technologies into child mental health management systems may lead to more precise and efficient models for the identification and treatment of disorders, ultimately potentially contributing to improved mental health support. 16
1.2. The critical role of emotion and attention management in children’s mental health
Childhood is a critical stage for personality formation and psychological development, as well as the foundational period for building various cognitive abilities and emotional regulation skills. During this phase, managing emotions and attention is not only essential for children’s overall development but also a key means of early identification of potential psychological issues and developmental delays.19,20 First, effective emotional regulation helps children maintain a positive attitude when facing challenges and setbacks, enhancing their performance in learning, social interactions, and self-regulation. 19 If children exhibit excessive anxiety, depression, or irritability when dealing with stress or social interactions, these may be early signs of underlying psychological disorders such as autism spectrum disorder (ASD) or attention deficit hyperactivity disorder (ADHD). 21 By identifying these signs of emotional dysregulation early, educators, psychologists, and parents can intervene in a timely manner, providing appropriate psychological support and interventions to prevent these issues from worsening and affecting children’s long-term mental health. For emotional problems, timely identification and intervention can reduce the risk of psychological disorders such as anxiety and depression 22 and mitigate their profound impacts on children’s learning and daily life.
In addition, attention deficits are often associated with underlying psychological or developmental issues. A lack of attention not only affects children’s academic performance but also may be a manifestation of ADHD. If children frequently struggle to concentrate during learning, are easily distracted, or exhibit significant difficulty in completing tasks, these signs may warrant further evaluation. Unresolved attention issues can hinder children’s academic development and lead to secondary emotional problems, such as academic anxiety or decreased self-confidence. Therefore, early identification and assessment of emotional and attention-related issues in children are crucial for preventing the development of potential psychological disorders and developmental delays. 23 By fostering collaboration among professionals in the psychology, education, and healthcare sectors, regular assessments of children’s emotions and attention can help identify underlying psychological issues early and enable appropriate interventions to prevent further deterioration. These early interventions not only support children’s healthy development during childhood but also lay a solid foundation for their future mental health and social adaptability.
In practice, artificial intelligence does not provide a definitive diagnosis independently; rather, it functions by employing advanced machine learning algorithms, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), to extract “digital biomarkers” from the multi-modal inputs shown in Figure 1, including facial expressions, voice, eye-tracking patterns, and physiological signals (e.g., EEG and ECG). In addition to conventional CNN architectures, advanced models such as 3D convolutional neural networks have been applied to capture spatiotemporal features in neuroimaging data.24,25 These biomarkers often encompass subtle features imperceptible to the human eye, such as micro-expressions or atypical gaze scanning trajectories. As illustrated in the feature fusion stage of Figure 1, the system integrates these unstructured visual, auditory, and biological data points, transforming them into quantifiable features that may be associated with clinically relevant behavioral patterns, although direct mapping to standardized diagnostic criteria (e.g., DSM-5 or ICD-11) is not yet established. Ultimately, the processed outputs, which are presented as classification results and prediction outputs, serve as an analytical reference that provides clinicians with quantitative information to inform clinical decision-making. A conceptual schematic illustrating a potential multimodal AI framework for recognizing children’s emotions and attention. Rather than representing a standardized or fully validated pipeline, this figure integrates approaches reported in the literature to highlight how different data modalities may be combined. The framework begins with multimodal data inputs, which may include: (1) spatial data (e.g., images of facial expressions), (2) temporal data (e.g., speech signals and physiological signals such as heart rate variability), (3) behavioral data (e.g., gaze tracking), and (4) motion data (e.g., body movement patterns). These heterogeneous data types typically require modality-specific preprocessing, such as normalization, filtering, synchronization, and feature standardization. In the feature extraction stage, different computational techniques may be applied depending on the data modality. For example, convolutional neural networks (CNNs) are commonly used for visual feature extraction, while recurrent models such as long short-term memory (LSTM) networks may be used to capture temporal dependencies. Traditional machine learning methods may also be incorporated for supplementary feature processing. At the integration stage, multimodal features may be combined through feature-level or decision-level fusion strategies, with mechanisms such as cross-attention potentially enhancing interactions between modalities. These approaches aim to improve the robustness of inference, although their effectiveness remains dependent on the data quality and experimental conditions. The output layer may include estimates of emotional states, attention patterns, and related behavioral indicators. However, these outputs should be interpreted with caution, as the current systems have been largely developed and evaluated in controlled settings, and their generalizability to real-world pediatric contexts remains an active area of research. Overall, this schematic highlights the potential structure of multimodal AI systems in this domain and is intended to provide a conceptual overview rather than a definitive or clinically validated workflow.
Therefore, the early identification, assessment, and intervention for emotional and attention-related issues in children are highly important for promoting their long-term healthy development.12,23 Through the effective management of emotions and attention, early detection of psychological problems, and timely intervention may contribute to improved psychological development and a potential reduction in long-term mental health risks. This approach may support long-term developmental monitoring and well-being. Specifically, the digital biomarkers extracted through this pipeline correspond to distinct clinically relevant dimensions: facial micro-expressions may reflect emotional dysregulation relevant to anxiety- or depression-related screening; atypical gaze-scanning trajectories may indicate altered social attention patterns associated with ASD; heart rate variability (HRV) may index autonomic regulation under affective stress; and motor pattern irregularities have been explored as potential motor signatures of neurodevelopmental conditions. These correspondences remain under active validation and should be interpreted as exploratory rather than diagnostic.
1.3. The potential of AI technology in supporting children’s emotions and attention recognition
The application of AI technology in managing children’s emotions and attention has advantages that surpass those of traditional methods. First, these technologies may support more personalized approaches to emotion and attention management. 16 Each child exhibits unique emotional responses, attention levels, and psychological challenges. Traditional psychological interventions often rely on experience and generalized strategies, and precisely tailoring approaches to individual needs is difficult. 12 In contrast, AI technologies can be applied to analyze large volumes of behavioral data to dynamically adjust intervention plans. 26
Second, the continuous monitoring capabilities of AI technologies offer increased temporal resolution and responsiveness compared to traditional methods for emotion and attention management. Traditional methods for assessing emotions and attention often rely on periodic evaluations or observations, which may overlook subtle emotional changes or fail to capture issues as they arise. In contrast, AI systems, which are equipped with various sensory devices, can be applied to continuously monitor children’s facial expressions, vocal tones, and even physiological data, thereby providing real-time data that may provide insights into their emotional fluctuations and attention changes. 26 When abnormalities are detected, such as a sudden emotional downturn or difficulty concentrating, the AI system may enable the generation of alerts under predefined conditions. This instant responsiveness may help facilitate earlier responses to emotional changes and help children promptly regain focus.
Third, the data-driven nature of AI systems makes emotion and attention management more scientific and precise. These systems can collect data not only from individual children but also from large-scale populations to extract valuable patterns and trends. Through long-term data accumulation, AI systems can identify patterns potentially associated with emotional fluctuations or attention deficits in children and inform the development of response strategies based on these findings. AI systems may contribute to the refinement of intervention strategies based on accumulated data, although their effectiveness in real-world clinical settings remains to be validated. This data-driven decision-making approach not only may support the optimization of intervention strategies but also provides data-informed insights for parents and educators, enabling them to better understand children’s psychological needs and adopt more appropriate guidance measures in daily life.
The specific implementation process of AI technology in identifying children’s emotions and attention is illustrated in Figure 1, which outlines the complete data processing workflow of a multimodal AI model applied to emotion and attention recognition. The process begins with data input, which is categorized into four main types: spatial data (e.g., facial expression images used to analyze children’s visual emotional features) 27 ; temporal data (e.g., speech signals and physiological signals such as heart rate variability (HRV), which capture vocal tone and bodily responses) 28 ; behavioral data (e.g., gaze tracking, reflecting attention focus) 29 ; and motion data (e.g., hand or body movement patterns for assessing emotional and attentional states). 30 In the data preprocessing stage, specific strategies are applied to each data type. Spatial data undergo image normalization and noise reduction, temporal data are filtered and synchronized to ensure temporal consistency across multimodal data, and behavioral data are standardized and smoothed to reduce interference and enhance analysis accuracy. During feature extraction, different techniques are employed on the basis of the data characteristics. Convolutional neural networks (CNNs) are used to extract hierarchical visual features from images. 31 Long short-term memory (LSTM) networks capture the temporal dependencies in speech and physiological signals, 32 and traditional machine learning methods (e.g., support vector machines (SVMs) and random forests) are utilized to process supplementary features, such as behavioral patterns. In the fusion layer, multimodal features are integrated through feature fusion, leveraging cross-attention mechanisms to enhance correlations between different data sources. Furthermore, decision fusion strategies, such as weighted voting and ensemble learning, are employed to improve prediction stability and ensure more accurate emotion and attention analyses. 33 The final output of this conceptual framework may include three types of results: emotion recognition, attention detection, and psychological health assessments. The models can automatically identify children’s emotional states (e.g., happiness, sadness, and anger), track attention changes (e.g., focus or distraction), and provide quantitative assessments of psychological health. These outputs support psychological interventions and the development of personalized behavior improvement plans. This workflow demonstrates the potential and technical advantages of multimodal AI models in research on children’s psychology and behavior.
This table presents a summary of AI applications in detecting children’s mental health and learning disorders, covering four major areas: ASD, ADHD, anxiety/depression, and dyslexia.
The technological functions include speech analysis, facial expression recognition, physiological signal monitoring, eye movement analysis, behavior analysis, and gaze direction analysis. Speech analysis uses technologies such as LENA, CNN, and MFCC for ASD and ADHD detection. Facial expression recognition employs CNNs and other techniques to analyze emotions, aiding in ASD, ADHD, and dyslexia detection. Physiological signal monitoring focuses on the GSR, HRV, and EEG signals, using methods such as SVM to detect anxiety, depression, and dyslexia symptoms. Eye movement analysis applies RF and SVM methods to study pupil diameter and eye behavior, particularly in ASD and ADHD research. Behavior analysis integrates CNNs and similar technologies for ASD and ADHD diagnosis. Gaze direction analysis leverages LSTM networks and other techniques to support ASD and ADHD research. These AI technologies provide diverse tools for diagnosing mental health and learning disorders, laying a strong foundation for future research.
2. Applications of AI technologies in children's emotion analysis
We defined the age cutoff at 12 years to align with the developmental boundary of middle childhood. Research indicates that the age of 12 years serves as a critical transition point; while it concludes a period of consistent cognitive development, the subsequent onset of adolescence (typically age 13 years and older) introduces significant neurobiological and behavioral variances. 1 By limiting our inclusion criteria to children under 12 years of age, we aimed to minimize these confounding variables and focus on the efficacy of AI tools within a stable developmental window.
This table presents a summary of research achievements in detecting ASD from 2012 to 2025, covering various age groups, data sources, and methodologies.
Studies have utilized data from behavioral features, eye tracking, facial expressions, and speech analysis combined with AI technologies to increase diagnostic efficiency. Early studies, such as that by D. P. Wall et al. in 2012, employed the ADOS tool and achieved an accuracy of 99.7%, highlighting the importance of early diagnosis. In 2015, M. Duda et al. analyzed behavioral features with 97% accuracy, whereas Alessandro Crippa et al. achieved 96.7% accuracy by analyzing motion features 68 . After 2016, new technologies emerged. For example, Anna Anzulewicz et al. analyzed hand movements through tablet games, achieving 93% accuracy, 70 and Mahiye Uluyagmur-Ozturk et al. used facial expressions and the Relief-F algorithm, reaching 90% accuracy.71,78 Deep learning and multimodal analysis dominated later research. In 2019, Luke E. K. Achenie et al. employed M-CHAT-R and neural networks, achieving 99.92% accuracy. In 2020, Jingying Chen et al. combined gaze tracking and facial expression analysis to improve diagnostic efficiency. In 2023, Hasan Alkahtani et al. used the ASQ-10 scale with random forest and SVM, achieving 92% accuracy. Despite significant technological advancements, challenges such as limited sample sizes and insufficient cultural diversity remain. With increasing data availability, more efficient and widely applicable diagnostic technologies are expected in the future.
This table presents a systematic comparison of five types of AI models used in recognizing children’s emotions and attention, highlighting their advantages, limitations, applicable data types, computational complexity, and typical application scenarios.
SVM models are known for their low computational complexity and stability, making them particularly suitable for small datasets and simple signal processing, 79 such as HRV and emotional text classification. These models are commonly applied in early emotion screening (e.g., binary classification of happiness and sadness) and autism screening on the basis of EEG patterns. However, their ability to handle high-dimensional and nonlinear data is limited. CNN models excel in processing spatial data (e.g., images and videos) because of their deep structures, which effectively extract hierarchical visual features. They are widely used in facial expression analysis, eye-tracking, and video-based emotion dynamics studies. 62 For example, CNNs have been applied in emotion recognition and analyzing the micro-expressions of autistic children, as well as in attention assessments based on eye-tracking data. 80 However, these models require extensive labeled data and computational resources, which may restrict their use in resource-constrained environments. LSTM models are particularly effective for handling time series data because of their ability to capture temporal dependencies. They are suitable for analyzing speech and physiological signals 81 and are often used for modeling temporal patterns to support emotion detection and behavioral pattern analyses, such as identifying emotional fluctuations (e.g., frustration or stress) from speech intonation or EEG signals. 82 Despite their excellent performance in sequence modeling, LSTM models face challenges such as high dependency on data quality and consistency, along with significant training costs. GAN models have unique advantages in data generation, particularly for addressing data imbalance issues. 83 For example, GANs can generate synthetic samples of rare emotions (e.g., anxiety or fear) to increase the generalizability of emotion recognition models. 33 While they are effective at increasing data diversity and improving model robustness, GANs face challenges such as instability during training and high computational resource requirements. Multimodal fusion models focus on integrating multiple signal data types, such as facial expressions, speech, and physiological signals, to support emotion and attention recognition. 84 These models are applied to integrate multimodal data sources (e.g., speech, video, and physiological signals) for emotion recognition and mental health-related analyses, and have been explored in contexts such as autism-related emotion recognition. 85 Despite their advantages in terms of performance and accuracy, multimodal fusion models rely heavily on synchronized data processing and require substantial computational resources, limiting their scalability in real-world scenarios. These application scenarios represent commonly explored directions in the literature, though the specific implementations may vary across studies.
This narrative review has several limitations. First, the literature selection was not exhaustive and may be subject to selection bias. Second, the heterogeneity of included studies limits direct comparisons across methods. Third, most reported findings are derived from controlled experimental settings, which may restrict generalizability to real-world clinical environments.
2.1. Developmental progress of AI-assisted diagnosis for children’s mental health
Over the past decade, AI technology has made significant advancements in diagnosing children’s mental health. As shown in Table 2, the progression of research from 2012 to 2024 highlights a clear trajectory of evolution, transitioning from single-method to multimethod approaches and from static to dynamic techniques. This section provides a comprehensive review of key research developments in this field, laying the foundation for subsequent in-depth technical discussions.
Early studies focused primarily on applying traditional machine learning methods. Wall et al. (2012) pioneered the use of the Weka analysis tool for analyzing the ASD Diagnostic Observation Schedule (ADOS) data, achieving a diagnostic accuracy of 99.7%. This tool opened new avenues for automated diagnosis. 67 While this groundbreaking study revealed exceptional accuracy, its primary value lies in reducing diagnostic timelines and enabling early intervention. Duda et al. (2015) subsequently introduced an observation-based classifier (OBC), which achieved an accuracy of 97%. 4 Crippa et al. (2015) achieved 96.7% recognition accuracy through motion feature analysis, 68 highlighting the advent of a new paradigm in multimodal feature analysis.
Researchers focused on the integrated analysis of multisource data between 2016 and 2018. Liu et al. (2016) introduced eye-tracking technology into the diagnostic process, improving diagnostic objectivity despite a reduction in accuracy to 88.51%. 69 Anzulewicz et al. (2016) achieved a 93% accuracy rate by analyzing hand motion patterns in tablet-based games, highlighting the value of behavioral data in the diagnosis. 70 Levy et al. (2017) further refined this approach by applying machine learning to identify stable subsets of behavioral features for automated ASD detection, achieving 93% accuracy. 72 This phase of research was characterized by the diversification of data sources and the naturalization of data collection methods, laying a solid foundation for the application of deep learning technologies.
After 2019, deep learning techniques became a dominant focus in research. 75 Achenie et al. (2019) developed a feedforward neural network model based on the M-CHAT-R scale, achieving an impressive accuracy of 99.92% and significantly improving the efficiency of automated assessments. 26 Chen et al. (2020) integrated gaze fixation and facial expression analysis into a multimodal diagnostic framework, achieving high accuracy while optimizing computational efficiency. 74
Recent studies (2023–2024) have described higher levels of technical maturity. Alkahtani et al. (2023) reported, for the first time, 92% classification accuracy under specific experimental conditions on the ASQ-10 scale but noted potential limitations in cross-cultural applications. 76 Moreover, Alzakari et al. (2025) achieved slightly lower accuracy (95%) but emphasized the importance of validating geographic and cultural diversity, suggesting a path for future research. 77
This development trajectory reflects the continuous advancements in AI-assisted assessment and screening technologies in terms of accuracy, practicality, and universality. However, current studies still face challenges such as limited sample sizes and insufficient cultural adaptability. As the data scale expands and algorithms improve, these technologies may have the potential to contribute to clinical practice as further validation becomes available.
Based on the synthesis of current research and the high accuracy rates (frequently exceeding 90% in controlled experimental settings) presented in Table 2, one of the main potential applications of AI is large-scale preliminary screens. While these accuracy rates are promising, they should be interpreted with caution, as they are predominantly derived from controlled experimental settings and may not fully generalize to real-world clinical complexity (as noted in Section 2.1). Nonetheless, they provide a quantitative basis for evaluating the potential of AI as a supplementary screening tool. This capability may support the identification of children at potential risk within the general population, particularly in preliminary screening contexts, and may help inform subsequent referral decisions, although its effectiveness in real-world clinical workflows remains to be validated. Furthermore, AI provides quantitatively derived indicators, which may be influenced by data quality and model design. Beyond the initial assessment, AI-driven identification extends to long-term monitoring of progress. By maintaining continuous data tracking, these systems may enable the observation of longitudinal trends in children’s emotional and attentional behaviors following therapeutic interventions, supporting a more dynamic and personalized management approach.
2.2. Facial expression-based applications in children’s emotion recognition
Facial expression recognition (FER) serves as a commonly used approach for deriving quantitative behavioral indicators, supporting the automated analysis of affective states in children. By translating subtle facial cues into quantifiable data, FER may assist clinicians in identifying patterns of emotional dysregulation that might be overlooked during traditional subjective observations.
Facial expressions provide a direct insight into emotional expression, particularly in children, whose facial changes are highly dynamic and illustrative. From smiles to frowns, these subtle facial movements may reflect aspects of emotional states. Traditionally, the interpretation of these expressions has relied heavily on the subjective judgment of parents, teachers, and other caregivers. However, this approach is prone to bias and lacks the ability to provide continuous and comprehensive monitoring. With the advancement of AI technologies, particularly the application of deep learning algorithms, facial expression recognition has gradually transitioned from subjective interpretation to more structured and automated processing. 42 This technological progress supports the use of AI systems to capture children’s facial expressions in real time and analyze patterns of facial muscle movement. Thus, these systems can be used to classify different emotional states, such as happiness, sadness, anger, and surprise. 86
Facial expression recognition techniques are currently categorized into two main types: traditional methods and neural network-based methods. Traditional methods typically rely on handcrafted features, such as image processing techniques and pattern recognition. These approaches rely on predefined features established by experts and use extracted facial features to recognize expressions.78,87 While these methods can achieve good accuracy in specific scenarios and have advantages when working with smaller datasets, their primary limitation lies in their lack of generalizability. They struggle to handle variations in facial expressions across different contexts and environments.42,87,88 In contrast, neural network-based methods, particularly CNNs, exhibit robust self-learning capabilities,41,42,88 enabling the recognition of psychological disorders such as ASD.39,40,89 Neural networks can automatically learn and extract features from large datasets, making them significantly more powerful in analyzing complex and diverse emotional expressions. This advantage becomes especially pronounced with larger datasets.39,88 Additionally, CNNs are often combined with the facial action coding system (FACS) to perform more precise analyses of facial expressions. FACS is a technique that involves decomposing facial movements into distinct “action units,” which AI systems use to identify specific emotions. 41 Thus, FACS serves as a fundamental tool in the field of emotion recognition. Furthermore, generative adversarial networks (GANs) are increasingly being applied to facial emotion recognition.90,91 GANs can generate images of facial expressions corresponding to different emotional states, thereby providing more diverse data for training emotion recognition models.83,92 Overall, these advanced models further enhance feature representation and improve recognition performance, particularly when large-scale datasets are available.
Despite the promising potential of facial expression recognition technology in practical applications, several challenges remain. One major issue is the variability in human expressions. Owing to differences in individual facial muscle structures, there is significant variation in expression across individuals. Moreover, expressions are influenced by cultural, social, and biological factors, making accurate emotion recognition even more complex. 42 Another issue is that most existing emotion recognition datasets lack authenticity. Many datasets are derived from laboratory settings or artificially synthesized scenarios,93,94 and thus they may not accurately reflect emotional expressions in real-life contexts. Therefore, future studies should focus on developing more realistic datasets and improving algorithms to address real-world variability and uncertainties. The effectiveness of facial expression recognition systems is highly dependent on data quality and the robustness of feature representation, which directly influence the system’s sensitivity to subtle emotional variations and overall classification performance. 42 The selection and implementation of each step are crucial to the overall performance of the system. In particular, the image preprocessing and feature extraction stages play critical roles, as they determine the system’s sensitivity to emotional changes and directly affect the accuracy of subsequent classifications. Additionally, the choice of classifier is central to the effectiveness of emotion recognition.
Looking ahead, FER holds potential for deployment across diverse domains—from mental health management and educational assessments to continuous emotional monitoring in rehabilitation contexts.38,43,86 Nevertheless, its clinical utility as an objective tool for obtaining behavioral evidence currently remains constrained by the limited ecological validity of laboratory-derived datasets and the substantial cross-cultural and developmental variability in facial expressions, underscoring the need for more naturalistic, pediatric-specific training data before broader real-world deployment can be realized.
2.3. Speech-based applications in children’s emotion analysis
Speech and language analyses primarily function as a vehicle for developmental screening and behavioral quantification. By extracting acoustic features and linguistic patterns, these techniques offer reproducible metrics to detect early markers of neurodevelopmental conditions, such as ASD or language delays, which are essential for early-stage screening protocols.
Speech is one of the key pathways for emotional expression. In children, vocal features such as pitch, volume, and speech rate change with emotional state. These changes are often more subtle and concealed than facial expressions are, making it difficult for untrained listeners to accurately discern a child’s emotional state. The application of AI technologies in speech analysis focuses primarily on two areas: monitoring children’s language development and recognizing emotions. First, AI systems have displayed significant potential in tracking language development, particularly in predicting language acquisition delays and language disorders.35,95 AI technologies have also been widely used for the early diagnosis of ASD.32,78 Researchers extract features from speech behaviors collected via smart devices in everyday home environments to assess children’s language development and aid in determining a diagnosis. 73 These studies rely on rich sources of speech data gathered through various methods, including online games, social media posts, and recordings of infant cries. 96 Among these, the analysis of infant cries has gained particular attention. Through applications or professional recording equipment, AI systems can examine infant cries in detail to distinguish between different needs and emotional states, such as identifying whether the infant is in distress.96,97 These technologies not only help parents or caregivers better understand an infant’s needs but also predict certain developmental abnormalities.
Speech emotion recognition technology relies on the ability of AI systems to extract and analyze vocal features such as pitch, rhythm, volume, and tone in detail. Among common techniques, LSTM networks, a type of recurrent neural network (RNN) specifically designed for sequential data, are widely used. 98 One of the core methods for extracting speech features is the Mel-frequency cepstral coefficient (MFCC), 99 which simulates the human ear’s perception of sound, converting speech signals into cepstral coefficients that effectively convey emotional information. By analyzing these extracted features, machine learning models such as SVMs 7 and random forests 8 can classify speaker emotions, making them well-suited for early research in speech emotion recognition. With advancements in deep learning, methods such as deep belief networks (DBNs) 100 and autoencoders 101 have been widely adopted for extracting high-level speech features. These methods are particularly powerful when handling large volumes of speech data. Another significant technology is the emotional acoustic model, which specifically targets modeling emotional acoustic features. By analyzing aspects such as pitch, speed, and tone in speech, AI systems can accurately infer the speaker’s emotional state. 102 In addition to these techniques, automated language environment analysis systems play a critical role in monitoring infants’ language development.103–105 LENA systems assess infant vocal behaviors, track language development progress, and provide essential data for early intervention. The integration of these technologies not only enhances the accuracy of speech emotion recognition but also enables AI systems to identify children’s emotional states, including joy, anxiety, anger, and fatigue, providing valuable data for screening mental health and monitoring emotional development.37,43
Currently, AI-driven speech analysis has achieved remarkable results in areas such as language delays, dyslexia, and ASD,78,87 with high levels of accuracy. In particular, AI technology has been able to distinguish different needs by analyzing subtle variations in infant cries, providing caregivers with real-time assistance. However, the diversity of speech features presents challenges for this technology. Factors such as regional differences in languages and the influence of family environments can cause variations in speech patterns, imposing greater demands on the cross-linguistic applicability of AI technologies. While some studies have demonstrated the feasibility of AI technologies in cross-linguistic applications for languages such as German and Spanish, 106 validation is needed to determine whether these technologies can be effectively extended to more languages and diverse cultural contexts.
Overall, the application of AI technology in speech analysis not only demonstrates its potential in monitoring language development and supporting the identification of patterns associated with ASD but also highlights its value in analyzing emotions. By analyzing vocal features in detail, AI systems may detect patterns not easily perceived by untrained observers, which is highly important for enhancing the monitoring of children’s emotional development and enabling early interventions. 14 While its promise as a developmental screening instrument is well-supported, the transition to scalable clinical tools depends on resolving persistent challenges in cross-linguistic generalizability, speaker variability, and the ecological validity of data collected outside controlled recording environments.
2.4. Physiological signal-based applications in children’s emotion monitoring
Physiological sensing (e.g., PPG, GSR, and EEG) is specifically positioned for longitudinal monitoring and the estimation of latent internal states. Unlike overt behavioral cues, physiological signals provide a continuous stream of data that reflects autonomic nervous system responses, allowing the tracking of cumulative stress and emotional fluctuations over extended periods.
Currently, in the research and application of children’s emotion management, physiological signals are also regarded as key indicators reflecting emotions, in addition to facial expression and speech analysis. These physiological signals provide valuable emotional information, as they are closely linked to the autonomic nervous system’s responses. By measuring various physiological signals, such as electrocardiograms (ECGs), electroencephalograms (EEGs), galvanic skin responses (GSRs), and respiration rates, we can indirectly assess children’s emotional states.24,25,45,48,78 For example, when children feel anxious, stressed, or tense, their heart rate often increases significantly, and their GSR intensifies, providing clear indicators of emotional fluctuations. Traditionally, measuring these physiological signals required professional clinical equipment and settings. However, with rapid technological advancements, particularly the emergence of wearable devices and AI technologies, monitoring and analyzing these data is no longer limited to clinical environments but can now be integrated into everyday life.
AI technologies have shown significant potential in analyzing physiological signals to assess patterns associated with psychological disorders in children, such as ASD, ADHD, anxiety, and depression. Physiological signals, including HRV, GSR, EEG, ECG, and respiration rates, reflect children’s emotional and psychological states, revealing potential mental health issues.45,48–50,78 AI systems leverage various techniques to process these data and identify emotional and psychological abnormalities in children. Traditional machine learning algorithms, such as SVMs 7 and k-nearest neighbors (KNNs), 107 are often employed for the classification and analysis of physiological data, aiding in the identification of changes in emotional states, such as anxiety or depression. With the advancement of deep learning technologies, more sophisticated models, such as CNNs and RNNs, have been widely adopted to process temporal data from physiological signals. These models can automatically learn key features, with some studies reporting improved classification performance under specific conditions.51,108 Moreover, multimodal fusion techniques integrate data from various physiological signals for a comprehensive analysis, which may support a more comprehensive evaluation of children’s mental health.109,110 As AI systems become increasingly adaptive, they can construct personalized models on the basis of each child’s unique characteristics and long-term physiological responses. This capability allows for more precise predictions of psychological disorders and emotional recognition, offering robust support for early intervention and personalized treatment.
The integration of AI technologies with wearable devices has led to revolutionary advancements in children’s emotion monitoring. These smart devices can continuously track children’s physiological signals and transmit data in real time to cloud-based AI systems for analysis. When irregularities such as significantly elevated heart rates or accelerated breathing patterns that persist over a specific period are identified, the AI system automatically generates reports and sends alerts to caregivers, enabling them to promptly recognize and address potential anxiety or stress. 78 Furthermore, AI systems can analyze long-term data to predict future emotional fluctuations, allowing for the proactive development of intervention plans to effectively mitigate the negative impacts of emotional issues. 46 These technologies not only increase the accuracy of emotion monitoring but also provide valuable data support for long-term emotional management. In summary, monitoring physiological signals is best understood as a longitudinal tracking modality, providing continuous objective data streams that reflect children’s internal emotional states over extended periods in ways that overt behavioral measures cannot. While its integration with wearable devices and AI-driven analysis shows considerable promise, its broader clinical deployment requires larger pediatric-specific datasets, standardized signal processing pipelines, and prospective validation in ecological settings such as classrooms and homes.
3. Applications of AI systems in identifying children's attention
3.1. Eye-tracking technology for children’s attention recognition
Eye-tracking technology serves as the primary data acquisition layer for an objective attention measurement. This section describes how AI systems transform continuous eye movement streams into quantifiable indicators of children’s visual attention and concentration by capturing raw ocular signals, including fixation points, gaze duration, and blink frequency, through hardware-based methods such as PCCR.
Eye-tracking technology is an essential tool for studying children’s visual attention, and its integration with AI technology enables more precise data analysis. 20 Currently, eye-tracking technology relies primarily on the pupil center-corneal reflection (PCCR) method, an optical technique that uses cameras to capture images of the eyes and determines the point of gaze by analyzing corneal and pupil reflections. 20 Through deep learning analysis of these eye movement data, AI systems can accurately track children’s eye movement trajectories, providing insights into their attention focus.52,55 Additionally, methods such as random forests,8,53 CNNs, 56 and VGG-Net 111 are widely applied in processing eye-tracking images. By analyzing the position and dynamics of the eyes, 112 AI systems can precisely predict children’s gaze and focus when they view different images or scenes, aiding researchers in better understanding children’s visual attention patterns.
In eye-tracking image processing, random forest is a commonly used machine learning method. By aggregating decisions from multiple decision trees, this technique effectively handles multidimensional features from eye-tracking data, such as eye position, movement speed, and changes in gaze points. The random forest model not only accurately predicts children’s gaze focus when they view different scenes but also assists researchers in analyzing their visual preferences and attention patterns, providing a stable and reliable approach for attention analysis. The CNN and its improved model, VGG-Net, play critical roles in the deep analysis of eye-tracking images. CNNs are typically used to extract spatial features from images, such as eye movement direction and gaze point location. VGG-Net, as a traditional deep learning architecture, can be applied to explore detailed features in high-resolution images. Through these models, AI systems can precisely predict children’s visual focus on different stimuli and analyze their attention distribution patterns, offering a more in-depth evaluation of attention. By integrating deep learning models (e.g., CNN and VGG-Net), machine learning methods (e.g., random forest), high-precision hardware devices, and advanced algorithms, AI technology has significantly enhanced the application of eye tracking in assessing attention and social skills.
Eye-tracking technology measures children’s attention levels by capturing and analyzing their eye movements. High-precision cameras record eye movement trajectories, whereas specialized algorithms process patterns such as the gaze duration, blink frequency, and changes in fixation points. 113 This technology provides an objective and accurate method for identifying and managing children’s attention, aiding in improving their learning efficiency. 113 Moreover, the integration of AI and eye-tracking technologies has extensive applications in analyzing children’s social interaction abilities, particularly in the early detection of ASD.53,114 By monitoring children’s visual focus, gaze duration, and eye movements, these technologies can identify abnormal behaviors in social interactions.115,116 This is significant for detecting potential issues in children’s social skills or identifying atypical visual attention patterns.
Specifically, AI technology can be applied to analyze eye-tracking data, revealing that children with ASD often pay less attention to facial features, such as the eyes and mouth, during interactions—one of the common traits of ASD.24,45,109,117 Additionally, this technology can be applied to examine children’s eye movement responses to visual stimuli and detect abnormal visual processing patterns, which may also serve as potential indicators of ASD. 118 These technologies provide powerful tools for early ASD diagnosis, assisting medical professionals in implementing timely interventions. In addition to ASD, eye-tracking technology has been applied to detect ADHD.44,56 By analyzing eye movement patterns, researchers can identify issues such as inattention or visual distraction, which are typical symptoms of ADHD. 21 Furthermore, some studies have integrated facial expressions, 3D body posture, and other information to conduct more in-depth analyses of children’s attention fluctuations during specific visual tasks. This approach provides more detailed data regarding their attention problems.57–59,119,120 The combined application of these technologies has significantly enhanced the ability to detect and analyze attention-related issues in children.
As the foundational data acquisition layer, eye-tracking technology provides the objective ocular signal streams upon which all subsequent gaze analyses depend; its current maturity in pediatric research is promising, though consistent accuracy across naturalistic settings with younger or less cooperative children remains an ongoing technical challenge. Once the eye-tracking data are accurately captured through these technological modalities, the subsequent challenge lies in interpreting these raw ocular movements through a gaze analysis to understand a child’s specific attentional focus.
3.2. Behavioral analysis techniques for monitoring children’s attention
This section examines how AI-driven behavioral analysis techniques leverage computer vision and deep learning to quantify children’s attention through observable non-verbal cues, including body movements, postural dynamics, and fine-grained gestural patterns. By translating these behavioral signals into objective metrics, AI systems may support the monitoring of attentional states across both general educational and clinical contexts.
The application of AI technologies in behavioral analyses is becoming increasingly sophisticated, transitioning from simple movement detection to a high-dimensional behavioral understanding. These technologies primarily leverage computer vision and deep learning to capture a wide array of nonverbal cues, including body movements, postural transitions, facial expressions, and micro-gestures. 118 Behavior recognition models that utilize AI to analyze both spatial features, such as body orientation and joint positions, and temporal dynamics, such as the duration and frequency of specific movements, to infer underlying behavioral states are central to this process. 62 Traditional machine learning algorithms, such as random forests 8 and SVMs, 7 continue to play a vital role by processing hand-crafted features extracted from video streams to classify fundamental behavioral patterns, including movement velocity and gaze direction. 121 Furthermore, the integration of advanced deep learning architectures has significantly expanded the analytical scope. Convolutional neural networks (CNNs) are employed to extract complex spatial hierarchies from image frames, effectively identifying patterns in posture and movement. In parallel, recurrent neural networks (RNNs), particularly long short-term memory (LSTM) units, are adept at capturing the sequential dependencies of behavior over time, allowing for the dynamic and continuous monitoring of children during various activities.122,123
Modern AI systems facilitate comprehensive analyses by integrating multimodal data from various sensors to achieve higher precision. By synthesizing facial micro-expressions with postural stability, speech patterns, and even physiological signals, these approaches provide a holistic assessment of a child’s emotional and behavioral trajectory.54,82 For instance, some advanced frameworks combine a behavioral analysis with eye-tracking technologies, such as electrooculography (EOG), to conduct more detailed analyses of changes in attention. These systems analyze specific eye-tracking metrics, such as gaze movement paths, fixation duration, and blink frequency, to assess visual attention states in real time. This multimodal integration allows behavior recognition models to move beyond binary “attentive vs. inattentive” classifications to a more nuanced quantification. These models can distinguish whether a child is genuinely focused, immersed in play, experiencing cognitive overload, or suffering from physical fatigue.62,124 This high-precision analysis transforms subtle physical changes into objective, quantitative behavioral indicators, providing a more profound understanding of a child’s mental and cognitive state.60,117
In practical educational environments, these quantitative metrics are invaluable for creating a responsive feedback system to improve learning efficiency. An AI-driven behavioral analysis enables the automated detection of subtle shifts in concentration during classroom activities or social interactions. When the system identifies a decline in a student’s level of focus, as indicated by increased head-tilting, decreased gaze fixity, or unintentional unrelated hand movements, it can provide timely feedback to educators.113,124 This data-driven insight may inform individualized strategies, such as adjusting instructional methods or task difficulty according to the child’s context. Moreover, these AI-driven methods are noninvasive and capable of long-term monitoring, providing objective data that complement traditional subjective observations. This continuous tracking helps in understanding how a child’s attention fluctuates throughout the day, enabling a more personalized approach to education and mental health support.63,125,126
While primarily focused on general attention monitoring, these behavioral analysis techniques also provide critical insights in clinical and neurodevelopmental contexts. For children with developmental disorders such as ASD or ADHD, AI systems can identify specific atypical behavioral patterns that may be difficult for human observers to quantify consistently. For example, individuals with ADHD may display frequent physical restlessness and difficulty maintaining visual attention.62,127 AI technologies can detect these common symptoms by monitoring attention fluctuations over extended periods, thereby improving the diagnostic accuracy of clinical assessments.48,63,121,128 By identifying differences in facial interactions and behavioral expressions, these systems assist medical professionals in designing personalized intervention plans and early treatment strategies.58,117 In this capacity, AI serves as an auxiliary tool that complements professional judgment, ensuring that the early detection of and interventions for psychological disorders are grounded in objective behavioral data.60,124
In summary, the integration of traditional machine learning, deep learning models, and multimodal data enables automated, high-precision monitoring of children’s behavior and attention. These technologies not only accurately identify children’s attention levels but also detect abnormal behavior patterns, demonstrating significant practical value in both general education and specialized clinical support. By fostering a better understanding of children’s behavioral performance, these AI-driven approaches promote overall mental health development and improve learning outcomes. Collectively, a behavioral analysis currently functions most reliably as a quantitative monitoring instrument in structured educational and clinical settings, with its translation to uncontrolled naturalistic environments remaining the primary barrier to scalable deployment.
3.3. Gaze direction-based applications in children’s visual attention recognition
Gaze direction estimation operates at the computational inference layer, building upon the raw ocular data captured by eye-tracking hardware. Rather than recording where the eye physically moves, the techniques described in this section reconstruct where a child is directing their attention in three-dimensional space by integrating head pose, pupil geometry, and multimodal sensor fusion, thereby enabling higher-precision assessments of attentional allocation and social engagement patterns.
Head pose-based gaze estimation techniques are applied to analyze the position and angle of a child’s head and infer their gaze direction. AI systems capture head movements through cameras and combine these data with eye-tracking information to accurately determine gaze points.20,123 Additionally, pupil localization techniques estimate a child’s gaze focus by precisely identifying the pupil’s position and analyzing its movement direction. This approach is often integrated with CNNs or other deep learning algorithms, such as LSTM networks, to achieve more accurate gaze direction estimation.54,59,128
In higher-precision applications, 3D gaze estimation technology integrates AI methods with 3D modeling to calculate children’s gaze points by analyzing the three-dimensional spatial relationships between the head and eyes. This approach enables AI systems to more accurately predict gaze direction in three-dimensional environments. 129 Moreover, multimodal gaze analysis technology combines eye-tracking data, head posture, and facial expressions to comprehensively determine children’s gaze direction. 74 By integrating data from multiple sensors, this approach provides more accurate visual attention analysis in various environments. 128 The use of AI technologies for gaze direction detection has significant potential in the identification of psychological disorders. These technologies utilize high-resolution cameras and gaze-tracking algorithms to precisely capture and analyze individuals’ gaze directions, allowing for assessments of their psychological state. Abnormal changes in gaze behavior, such as difficulty maintaining focus, frequent shifts in gaze, or prolonged fixation on irrelevant objects, may be associated with certain psychological conditions, including ADHD52,53,56,62,123 or ASD.48,55,61,64
In clinical settings, AI technologies can assist professionals in tracking patients’ gaze patterns and comparing them with symptoms of specific psychological conditions.52,130 For example, in social interaction scenarios, if a child is expected to maintain eye contact with their conversation partner but the system detects persistent gaze aversion or difficulty sustaining eye contact, this may indicate early signs of ASD. Similarly, for individuals with anxiety disorders, frequent changes in gaze direction or abnormal gaze patterns may prompt professionals to conduct additional psychological evaluations. This technology provides a noninvasive and continuous monitoring method that can identify potential psychological issues at an early stage and support professionals in making diagnoses. By automatically analyzing gaze data, AI systems can deliver critical behavioral indicators in real time, facilitating early intervention and treatment of psychological disorders. These technologies not only offer powerful tools for mental health management but also help parents and educators better understand children’s emotions and behavioral responses, enabling personalized support and guidance.
Despite these advances, the translation of gaze direction estimates from controlled laboratory settings to real-world pediatric environments remains a critical challenge. In unstructured settings such as classrooms or therapy rooms, the accuracy of the estimate decreases substantially due to environmental variability, including inconsistent illumination, the occlusion of facial landmarks, and the wide range of spontaneous head movements characteristic of young children. 65 Children with ASD or ADHD are particularly prone to rapid, unpredictable head rotations and reduced cooperation during calibration procedures, which introduces systematic noise into both head-pose-based and pupil-localization-based estimation pipelines. 56 Furthermore, most existing gaze estimation models have been trained on adult datasets or highly constrained child-specific paradigms, limiting their generalizability to the naturalistic, dynamic interaction scenarios most relevant to clinical and educational assessment. 66
Addressing these limitations requires more ecologically valid training datasets that reflect the behavioral diversity of pediatric populations and the development of robust, calibration-free estimation frameworks suitable for deployment outside of laboratory contexts. As gaze direction estimation matures from a precision instrument in controlled research to a scalable tool in applied settings, its capacity to serve as a reliable computational inference layer—translating raw ocular data into clinically actionable indicators of attentional allocation—will ultimately determine its utility in supporting the early identification of and interventions for children with developmental disorders.56,65 Recent advances in self-supervised learning and cross-domain adaptation may offer promising pathways to mitigate the adult-to-pediatric domain gap, enabling pretraining on large-scale adult datasets followed by fine-tuning on limited pediatric samples. These approaches, alongside federated learning frameworks for privacy-preserving multi-site data aggregation, represent active research directions for advancing the ecological validity of pediatric gaze-based assessments.
4. Challenges and future developments of AI technology
4.1. Challenges in emotion and attention recognition for children
Although AI technology has demonstrated significant potential in recognizing children’s emotions and attention, it still faces several critical challenges. The foremost issue is data privacy and security. AI systems require vast amounts of personal data for learning and analysis, including highly sensitive information such as children’s facial expressions, speech, behavioral patterns, and physiological signals. 131 Ensuring the functionality of these systems while maximizing the protection of these data poses a significant challenge. Unauthorized access to or leakage of such data could have severe consequences for children and their families. To address this problem, stringent data protection measures must be implemented, including data encryption, access control, and anonymization, to ensure data security and privacy.
Ethical concerns represent another significant challenge. The application of AI technology in recognizing children’s emotions and attention must be approached with caution to avoid the risks of excessive surveillance or data misuse. 132 Overmonitoring may impose stress on children, affect their sense of autonomy and privacy, and even distort their natural behavior. To mitigate these issues, the design and implementation of AI systems must carefully consider ethical implications, ensuring that the primary goal is to promote children’s healthy development rather than excessive control or commercial gain 16 .
Accuracy is another critical challenge that needs to be addressed. The effectiveness and reliability of AI systems in emotion and attention recognition depend directly on their accuracy. If the error rate of an AI system is too high, it could lead to incorrect judgments and interventions, potentially harming children’s learning and mental health.131,133 Therefore, continuously improving the accuracy of AI technologies, reducing error rates, and ensuring stable performance across various contexts are key objectives for developers. Table 3 presents a summary of the comparative characteristics of different AI models used in children’s emotion and attention recognition, including their advantages, disadvantages, applicable data types, computational complexity, and typical application scenarios. Comparative studies evaluating various classification models have been increasingly reported in the recent literature,25,62,79–81 while other works have focused on benchmarking performance across datasets and feature representations.33,82–85 Owing to their low computational complexity and stability, SVMs are particularly suitable for small datasets and simple signal processing, such as in HRV and emotional text classification. These models are commonly applied in early emotion screening (e.g., binary classification of happiness and sadness) and ASD screening on the basis of EEG patterns. However, these methods are limited in handling high-dimensional and nonlinear data. CNN models excel at processing spatial data (e.g., images and videos), and their deep structures can effectively extract hierarchical visual features. Thus, these models are widely applied in facial expression analysis, eye tracking, and dynamic video emotion studies. LSTM, known for its ability to process time series data, is highly suitable for analyzing speech and physiological signals because of its ability to capture temporal dependency features. GAN models have unique advantages in data generation, particularly in addressing data imbalance issues, making them valuable in creating diverse and balanced datasets. Multimodal fusion models, which focus on integrating multiple types of signal data, effectively combine facial expressions, speech, and physiological signals to provide comprehensive solutions for emotion and attention recognition.
4.2. Future development and prospects of AI technology in mental health identification
In the future, AI technology is anticipated to play a potentially increasing role in recognizing children’s emotions and attention. Its deep integration with other cutting-edge technologies is anticipated to enhance its functionality and potential applicability. Multidisciplinary integration is likely to become a key development trend. 3
New deep learning algorithms may have exploratory relevance for detecting children’s emotions and attention levels and supporting screening for potential psychological disorders, although such applications require rigorous validation in pediatric populations. Generative AI and multimodal learning models are among the technologies contributing to these developments. Models such as GPT-4 134 and Llama 2 135 have shown general multimodal and language-processing capabilities in non-pediatric or task-specific contexts. To date, the pediatric-specific validation of these general-purpose large language models remains limited; rigorous clinical evaluation in pediatric emotion and attention assessments is still forthcoming, and their direct deployment in child-focused clinical workflows should be regarded as exploratory. In exploratory ADHD-related research, LLM-integrated robotic platforms may generate interaction-based behavioral signals (e.g., attention patterns and emotional responses), although such approaches remain hypothetical for deployment in pediatric populations. 136 Similarly, graph neural networks (GNNs) and wearable-integrated deep learning systems have been proposed for modeling behavioral and physiological data, but their application in pediatric psychological assessments remains largely unvalidated.
AI recognition technology can be integrated with virtual reality (VR) and augmented reality (AR) technologies.62,113,137 This integration has not only been associated with enhanced learning experiences but also with potential improvements in user engagement, although the evidence for therapeutic outcomes remains limited. For example, AI systems may be used to estimate users’ learning progress, attention levels, or emotional states, allowing for adjustments in the difficulty or content of virtual courses accordingly. AR technology enhances abstract learning materials, such as visualizing mathematical concepts or simulating historical scenes, to create immersive experiences for students. In language learning, the combination of AI and AR technologies can create realistic conversational environments, allowing users to practice dialogs in simulated everyday scenarios, which may contribute to improved learning engagement. 138
Similarly, in psychotherapy, these technologies have been explored for applications in mental health, rehabilitation, and behavioral therapy. These approaches show potential in addressing emotional conditions such as anxiety, stress, and depression, although the evidence remains heterogeneous. For example, AI systems may be used to monitor physiological responses (e.g., HRV, eye movements, or GSR) during immersive therapy sessions, enabling adaptive adjustments to intervention content based on user states. AR/VR technologies can create simulated environments that allow users to engage in structured interaction scenarios. A previous study has demonstrated the use of VR systems to support social interaction training in children with autism. 139 More broadly, these environments may be extended to other situations, although these applications remain less well established. In addition, VR-based interventions have been associated with potential improvements in emotional well-being in populations with anxiety and depression, although the current findings are preliminary and vary across study designs. 140
By analyzing patients’ emotional expressions and linguistic features, AI systems may support therapists in developing more tailored psychological intervention strategies. This technological integration opens new possibilities and may foster innovation across domains such as education and healthcare. As the accuracy of AI-based recognition improves alongside advancements in VR/AR technologies, the potential applications of these systems may continue to expand.
In addition, the establishment of policies and regulations will play a critical role in the future application of AI technologies. As these technologies become widely used in recognizing children’s emotions and attention, it is essential to develop relevant legal and policy frameworks to ensure their safe and lawful use. These policies should include regulations for data privacy protection, ethical guidelines for technology usage, and requirements for accuracy and reliability. Policymakers must work closely with technology developers, educators, and psychology experts to ensure that the application of AI technologies promotes advancements in children’s education and mental health while safeguarding their rights and development from potential adverse impacts.
The application of the Internet of Things (IoT) in recognizing children’s emotions and attention has become a key driving force characterized by greater intelligence and collaboration. With the continuous advancement of IoT devices and the integration of multimodal sensing technologies, future systems may be able to more comprehensively capture children’s physiological, behavioral, and environmental data, such as HRV, GSR, speech features, and EEG signals, 141 providing a more comprehensive data foundation for the dynamic monitoring of emotions and attention.
Future IoT systems may leverage advancements in edge computing and AI technologies to support real-time feedback, although their implementation in pediatric contexts remains an open research challenge. The technical feasibility of such architectures has been demonstrated in diverse IoT applications outside the healthcare domain that require near-real-time response, as illustrated in previous studies of safety monitoring and environmental sensing systems.142,143 The synergy among IoT sensors is expected to evolve from simple unidirectional data transmission to more interactive networks capable of adaptively adjusting parameters based on situational needs. For example, when emotional fluctuations are detected, these systems may generate context-aware feedback, such as guiding a child through simple breathing exercises or adjusting ambient lighting to enhance focus.
Furthermore, with the widespread adoption of 5G and future communication technologies, the connectivity of IoT devices will be significantly improved, reducing data transmission latency and supporting closer to real-time processing. Moreover, IoT-based data platforms integrate more efficient encryption technologies to ensure data privacy and security, increasing parents’ and educational institutions’ confidence in the use of these systems. In the future, we can also expect the inclusion of new IoT sensors, such as wearable devices capable of directly monitoring brainwave activity, providing more diverse data sources for emotion and attention recognition. As these technologies continue to evolve, the integration of the IoT and emotional AI technologies will not only focus on detection and feedback but also move toward long-term data insights and predictive analytics, potentially contributing to technological support for children’s health and well-being.
In the future, the role of AI may evolve beyond its current function as an auxiliary “second opinion” tool toward a more integrated clinical support tool, particularly as the evidence and validation continue to develop. By leveraging multimodal feature fusion (Figure 1), the current systems provide quantitatively derived indicators that assist clinicians in identifying risks. Future developments may focus on integrating longitudinal behavioral tracking to help clinicians better understand a child’s progress over time. This data-driven approach aims to complement traditional screening methods, helping to bridge the gap in pediatric mental health resources.
5. Conclusions
In conclusion, artificial intelligence has shown promising potential as an auxiliary decision-support tool in the management of children’s emotion and attention. The primary value of AI modalities, ranging from facial and speech analysis to physiological sensing, lies in their ability to provide objective behavioral indicators and facilitate longitudinal monitoring. These technologies do not aim to replace clinical judgment but rather to augment it by offering quantifiable, reproducible metrics that may help mitigate certain limitations associated with the subjectivity of traditional rating scales and intermittent clinical observations.
Despite these advancements, several critical challenges hinder the widespread clinical adoption of AI systems. The field currently grapples with significant dataset heterogeneity and a lack of ecological validity, as many high-accuracy models are trained in controlled laboratory settings that do not generalize to real-world environments like homes or classrooms. Furthermore, concerns regarding cross-cultural generalizability, data privacy, and the absence of large-scale external validation remain substantial barriers that must be addressed to ensure the ethical and robust application of these tools.
In the future, the evolution of AI in pediatric care must shift from mere model scaling to a focus on multimodal integration and clinically interpretable outputs. Future research should prioritize real-world validation and the development of privacy-preserving deployment strategies, such as federated learning. Emerging integrations with technologies such as virtual reality (VR), augmented reality (AR), and Internet of Things (IoT) systems may further support adaptive and context-aware monitoring environments, although these approaches remain largely exploratory. Moreover, the emergence of interactive technologies, including large language models (LLMs) and social robotics, suggests a transition from passive monitoring toward active systems that may inform individualized intervention strategies in research or exploratory settings. Ultimately, establishing a standardized framework for evidence-based AI will likely be important for bridging the gap between technological innovation and sustainable clinical practice.
Footnotes
Acknowledgments
The authors thank the National Science and Technology Council of Taiwan and the National Health Research Institutes and collaborating institutions for their support and contributions to this study.
Author contributions
Conceptualization: Yi-Ling Fan and Lun-De Liao. Methodology: Yi-Ling Fan. Investigation: Yi-Ling Fan. Data curation: Yi-Ling Fan and Guan-Lin Wu. ;Writing—original draft preparation: Yi-Ling Fan and Ying-Ying Tsai. Writing—review and editing: all authors. Supervision: Ching-Han Hsu, Hui-Ju Chen, Fang-Rong Hsu, Hung-Yi Chiou, and Lun-De Liao. All authors have read and agreed to the published version of the manuscript.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported in part by the National Science and Technology Council of Taiwan under grant numbers 110-2221-E-400-003-MY3, 111-3114-8-400-001, 111-2314-B-075-006, 111-2221-E-035-015, and 111-2218-E-007-019; by the National Health Research Institutes of Taiwan under grant numbers NHRI-EX108-10829EI, NHRI-EX111-11111EI, and NHRI-EX111-11129EI; by the Ministry of Economic Affairs of Taiwan under the grant numbers MOHW 112-0324-01-30-06 and MOHW 113-0324-01-30-06 and MOHW 113-0324-01-30-11.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
