Abstract
The increasing use of immersive virtual reality (VR) in English as a foreign language (EFL) education has generated growing interest in its potential to support learner engagement and skill development. However, empirical evidence explaining how engagement in VR-based learning relates to perceptual, motor, and cognitive skills remains limited. This study investigated the structural relationships among VR engagement, perceptual skills, motor skills, and cognitive skills in an EFL context. Data were collected from EFL learners who participated in VR-based English learning tasks and were analyzed using structural equation modeling with SPSS (v27) and AMOS (v24). The measurement model demonstrated acceptable reliability and validity for all constructs. The structural model revealed that VR engagement was a significant predictor of perceptual, motor, and cognitive skills, with the strongest effect observed for cognitive skills. These findings indicate that learner engagement plays a central role in shaping how VR-based activities are experienced and perceived in language learning contexts. The study contributes to immersive language learning research by offering a process-oriented account of VR use that emphasizes engagement as a key mechanism rather than treating technology as an isolated factor. Pedagogical implications highlight the importance of designing VR language tasks that foster active participation and sustained involvement.
Introduction
Digital technologies have become integral to second and foreign language education, reshaping how learners engage with linguistic input, interaction, and practice (Derakhshan et al., 2025, 2026; Derakhshan & Park., 2026; Lu et al., 2024). Among recent developments, virtual reality (VR) has attracted growing attention for its potential to provide contextualized language exposure and interactive environments that are difficult to achieve in conventional classrooms (Alfadil, 2020; Derakhshan et al., 2024; Parmaxi & Demetriou, 2020; Zhonggen, 2018). VR-based language learning environments often combine visual, auditory, and interactive elements, enabling learners to participate in simulated communicative situations that resemble real-world contexts (Hsu, 2017; Hung et al., 2018). Such features have been linked to increased learner engagement and motivation, particularly in English as a Foreign Language (EFL) contexts where authentic exposure is limited (Govender & Arnedo-Moreno, 2021; Yang et al., 2020).
Recent empirical studies have reported positive effects of VR-supported instruction on various language outcomes, including vocabulary acquisition, pronunciation, listening comprehension, and speaking confidence (Chang et al., 2020; Lin & Wang, 2021; Tai et al., 2022). Meta-analytic and review work has further suggested that immersive technologies may foster deeper involvement and sustained attention compared to screen-based learning environments (Alfadil, 2020; Koç et al., 2022). At the same time, scholars have cautioned against treating VR as a uniform intervention, noting wide variation in task design, interaction modes, and pedagogical integration across studies (Fokides & Zampouli, 2017; Parmaxi & Demetriou, 2020).
More recent research has shifted attention from general learning outcomes to learner-related processes, such as engagement, cognitive processing, and emotional responses in immersive environments (Barrett et al., 2023; Derakhshan et al., 2024; Schmidt et al., 2023). In EFL contexts, engagement has been identified as a key mechanism through which technology-mediated instruction influences learning, acting as a link between instructional features and learner outcomes (Alalwan et al., 2020; Moon et al., 2020). However, while engagement has been examined in mobile and online learning settings (J. C. Chen, 2016; Y.-L. Chen, 2016; Hsieh et al., 2022), its role in VR-based language learning remains under-theorized and unevenly operationalized.
Despite the growing body of VR research in language education, several conceptual and methodological gaps remain. First, much of the existing literature has focused on performance outcomes or learner perceptions without sufficiently unpacking the underlying skill domains involved in VR interaction (Lai & Chen, 2023; Lin et al., 2022). VR environments often require learners to process multimodal input, respond under time constraints, and interact through physical movements or gestures. These demands suggest that perceptual, motor, and cognitive skills may play a central role in shaping learners’ experiences and outcomes. Yet, these skill domains are rarely examined together in EFL VR studies, and when they are mentioned, they are often treated in a descriptive or speculative manner (Dhimolea et al., 2022; Fokides & Zampouli, 2017).
Second, recent studies have highlighted a tendency to label learning environments as “immersive” or “intelligent” without providing sufficient empirical detail about learner interaction with the technology (Keller et al., 2024; Schmidt et al., 2023; Weng et al., 2024). Several scholars have criticized the practice of attributing learning gains to advanced technologies while relying on self-report instruments that were not originally developed for such contexts (Chen et al., 2022; Xu et al., 2023). This issue has led to concerns about construct validity and the risk of overstating technological effects, particularly when perceptual and motor involvement is assumed rather than measured (Bendeck Soto et al., 2020; Govender & Arnedo-Moreno, 2021).
Third, although engagement has been widely acknowledged as a mediator in technology-enhanced learning, empirical models that simultaneously examine engagement and multiple skill domains in VR-based EFL contexts remain limited (Lai & Chen, 2023; Zhang et al., 2022). Emerging research has begun to explore how immersive environments influence cognitive load, attention, and real-time processing (Keller et al., 2024; Lin et al., 2022), yet few studies have adopted integrative analytical approaches, such as structural equation modeling, to test these relationships within a coherent framework. This limitation is particularly evident in large-scale EFL contexts, such as China, where VR adoption is increasing but empirical evidence remains fragmented (Li et al., 2025; Weng et al., 2024; Zhang et al., 2024).
Finally, recent discussions in applied linguistics have emphasized the need for theoretically grounded and methodologically transparent research on emerging technologies (Derakhshan et al., 2024; Yazdi & Ghanizadeh, 2024). Scholars have called for studies that move beyond novelty effects and examine how specific learner experiences in technology-mediated environments relate to well-defined psychological and skill-based constructs (Ghafouri et al., 2025; Saeedi & Najjarpour, 2025). Addressing these concerns requires careful operationalization of constructs and analytical models that can capture complex relationships among engagement and skill development.
In response to these gaps, the present study aims to examine the relationship between immersive VR-based language learning and EFL learners’ perceptual, motor, and cognitive skills, with a particular focus on the role of learner engagement. Drawing on prior work in VR-assisted language learning (Alfadil, 2020; Parmaxi & Demetriou, 2020; Tai et al., 2022) and engagement theory in technology-enhanced education (Alalwan et al., 2020; Moon et al., 2020), this study proposes an integrative model in which VR engagement is linked to multiple skill domains relevant to language learning in immersive environments.
Specifically, the study seeks to (a) examine EFL learners’ levels of engagement in VR-based language learning, (b) investigate their perceived perceptual, motor, and cognitive skills during VR tasks, and (c) test the structural relationships among these constructs using structural equation modeling. By focusing on learners’ reported experiences with concrete VR language tasks rather than abstract technological claims, the study aims to provide a more nuanced account of how VR-based learning relates to different dimensions of learner functioning.
Through its empirical and analytical approach, this study contributes to the growing literature on immersive technologies in EFL education by addressing calls for clearer construct definition, stronger methodological rigor, and integrative modeling (Keller et al., 2024; Schmidt et al., 2023; Zhang & Miao, 2025). The findings are expected to inform both researchers and practitioners about the conditions under which VR-based language learning may support learner engagement and skill development, while also highlighting areas where further theoretical and empirical refinement is needed.
Review
Virtual reality (VR) has increasingly been positioned as a promising tool in foreign language education, particularly in contexts where opportunities for authentic interaction are limited. Early work emphasized VR’s capacity to simulate communicative environments and provide contextualized input beyond textbook-based instruction (Hsu, 2017; Zhonggen, 2018). Subsequent studies extended this view by examining how immersive environments may support language practice through interaction, presence, and learner agency (Hung et al., 2018; Parmaxi & Demetriou, 2020).
In EFL research, VR has been associated with gains in vocabulary, pronunciation, listening comprehension, and learner motivation (Chang et al., 2020; Lin & Wang, 2021; Tai et al., 2022). Reviews and meta-analyses have generally reported positive trends, though they also highlight substantial variation in task design, duration, and outcome measures (Alfadil, 2020; Koç et al., 2022). More recent studies conducted in Asian EFL contexts, including China, suggest increasing institutional adoption of VR-supported instruction, alongside growing interest in learner-centered outcomes such as engagement and cognitive involvement (Li et al., 2025; Weng et al., 2024; Zhang et al., 2024). Despite this growth, the literature remains fragmented. Many studies focus on isolated outcomes or short-term interventions, with limited integration of theoretical perspectives on learning processes in immersive environments (Fokides & Zampouli, 2017; Govender & Arnedo-Moreno, 2021). This fragmentation has prompted calls for more systematic frameworks that link VR engagement to specific learner skills relevant to language learning.
Theoretical Background
Theoretical explanations of VR-based language learning often draw on constructivist and experiential learning perspectives, which emphasize learning through interaction, context, and active meaning-making (Hung et al., 2018; Parmaxi & Demetriou, 2020). From this perspective, VR environments may support language learning by situating linguistic input within meaningful scenarios that require perception, action, and decision-making. Engagement theory has also been widely applied in technology-enhanced language learning research. Engagement is typically conceptualized as a multidimensional construct involving behavioral, cognitive, and emotional components (J. C. Chen, 2016; Y.-L. Chen, 2016). In digital learning environments, engagement has been shown to mediate the relationship between instructional design and learning outcomes (Alalwan et al., 2020; Moon et al., 2020). Recent work has extended this framework to immersive settings, suggesting that presence and interaction may intensify engagement processes (Barrett et al., 2023; Schmidt et al., 2023).
From a cognitive perspective, VR-based tasks impose distinct processing demands. Learners must integrate visual and auditory input, respond under time constraints, and manage multiple sources of information. Cognitive load theory provides a useful lens for examining these demands, particularly in relation to working memory and real-time processing (Keller et al., 2024; Lin et al., 2022). At the same time, embodied cognition perspectives suggest that motor interaction and physical movement may support learning by linking language input to action (Bendeck Soto et al., 2020; Franco et al., 2025). Although these theoretical strands offer complementary insights, they are often applied in isolation. Few EFL studies explicitly integrate engagement theory with perceptual, motor, and cognitive accounts of learning in VR environments, leading to partial explanations of observed outcomes (Dhimolea et al., 2022; Lai & Chen, 2023).
To better align with perceptual–motor research traditions, it is useful to consider embodied cognition and sensorimotor learning theories. Embodied cognition posits that language understanding is grounded in sensory and motor systems, suggesting that physical interaction with virtual environments may enhance linguistic processing (Bendeck Soto et al., 2020; Franco et al., 2025). In this view, motor actions are not merely outputs but integral parts of the learning process. Sensorimotor learning further emphasizes that repeated practice of specific movements, such as hand gestures or eye movements during VR tasks, can refine neural pathways associated with language production and perception. This perspective helps explain how VR-based activities, which require precise motor coordination, might support the development of perceptual skills like pronunciation accuracy.
Additionally, the concept of perception–action coupling offers a framework for understanding how learners integrate sensory input with motor responses in immersive settings. Perception–action coupling suggests that perceiving an object or event directly prepares the motor system for action, facilitating quicker and more accurate responses (Barrett et al., 2023; Schmidt et al., 2023). In VR language learning, this coupling may occur when learners visually identify a target word and simultaneously execute a motor response, such as selecting an object or speaking a phrase. This tight link between perception and action may strengthen the association between linguistic forms and their meanings, potentially enhancing both cognitive processing and motor skills. By integrating these perceptual–motor perspectives with engagement and cognitive theories, the study provides a more comprehensive account of how VR-based learning influences EFL learners’ skills.
Empirical Studies
Empirical research on VR in EFL contexts has produced mixed but generally positive findings. Studies focusing on language performance have reported improvements in pronunciation accuracy, listening comprehension, and vocabulary retention following VR-supported instruction (Chang et al., 2020; Lin & Wang, 2021; Tai et al., 2022). These gains are often attributed to increased exposure and contextualized input, though causal mechanisms are not always clearly specified. Other studies have examined learner engagement and affective responses. Yang et al. (2020) and Lai and Chen (2023) found that immersive tasks increased learner involvement and willingness to participate, while Govender and Arnedo-Moreno (2021) emphasized the role of interaction design in sustaining engagement. In related work, Zhang et al. (2022) and Hsieh et al. (2022) demonstrated that engagement in technology-mediated environments was associated with deeper cognitive processing.
More recent research has attempted to move beyond surface-level outcomes by examining cognitive and perceptual dimensions of VR learning. Lin et al. (2022) and Keller et al. (2024) explored how immersive environments influence attention and processing efficiency, while Xu et al. (2023) highlighted the role of multimodal input in shaping learner perception. Studies grounded in embodied interaction have suggested that motor involvement may support memory and comprehension, though empirical evidence in EFL contexts remains limited (Bendeck Soto et al., 2020; Franco et al., 2025). Large-scale and model-based studies are still relatively scarce. While Li et al. (2025), Weng et al. (2024), and Zhang et al. (2024) have begun to use advanced statistical techniques to examine relationships among engagement, technology use, and learning outcomes, many studies continue to rely on descriptive designs or single-variable analyses. This limits the ability to test complex relationships among learner engagement and multiple skill domains.
Despite growing interest, several controversies characterize recent VR-based EFL research. One major concern relates to construct validity. Critics argue that many studies label learning environments as “immersive” without clearly specifying the nature of learner interaction or the extent of perceptual and motor involvement (Keller et al., 2024; Schmidt et al., 2023). In some cases, instruments originally developed for general e-learning contexts are repurposed for VR settings with minimal adaptation (Chen et al., 2022; Xu et al., 2023). Another issue involves the tendency to attribute learning gains to VR technology itself rather than to task design or instructional context. Fokides and Zampouli (2017) and Dhimolea et al. (2022) caution that novelty effects and learner expectations may inflate perceived benefits. Similar concerns have been raised in recent critiques of immersive and AI-enhanced learning research, which call for clearer theoretical grounding and more transparent reporting of learning processes (Derakhshan et al., 2024; Yazdi & Ghanizadeh, 2024).
There is also debate regarding the role of cognitive load and motor demands in VR learning. While some studies suggest that embodied interaction supports learning (Franco et al., 2025; Wei et al., 2025), others warn that excessive sensory input and complex interaction may overload learners, particularly at lower proficiency levels (Lin et al., 2022; Wang et al., 2025). These mixed findings indicate a need for balanced models that consider both affordances and constraints of VR environments. Finally, recent work has emphasized the importance of methodological rigor and integrative modeling. Scholars argue that future research should move beyond isolated outcomes and examine how engagement, perception, motor interaction, and cognition jointly contribute to learning (Ghafouri et al., 2025; Saeedi & Najjarpour, 2025; Zhang & Miao, 2025). Addressing these issues requires analytical approaches capable of testing complex relationships, such as structural equation modeling, within well-defined theoretical frameworks.
Research Questions
Method
Participants
The participants were 509 Chinese learners of English as a foreign language (EFL) recruited through convenience sampling from several universities in China. Participation was voluntary, and all respondents completed the questionnaire online after being informed of the study purpose and confidentiality of their responses. Regarding education level, 303 participants were undergraduate students (59.5%), while 206 were graduate students (40.5%). In terms of gender, 232 participants identified as men (45.6%) and 277 as women (54.4%). Participants also reported their frequency of using virtual reality (VR) for language learning. Most learners indicated rare use of VR (n = 377, 74.1%). Smaller proportions reported using VR sometimes (n = 80, 15.7%), often (n = 42, 8.3%), or very often (n = 10, 2.0%). In addition, participants were asked to indicate the types of VR-based language tasks they had experienced, with multiple responses allowed. Pronunciation practice was reported by 147 learners (28.9%). Writing or typing practice was selected by 97 learners (19.1%), and interactive dialogue tasks by 98 learners (19.3%). Vocabulary and grammar exercises were reported by 79 learners (15.5%), while cultural immersive activities were reported by 88 learners (17.3%). These responses indicate varied exposure to VR-supported language learning tasks among the participants.
It is important to clarify the nature of VR exposure in this study to address concerns regarding the operationalization of the independent variable. Participants were not assigned to a single, standardized VR intervention; rather, they engaged with a variety of VR-based English learning activities embedded within their regular university coursework over the semester. These activities ranged from pronunciation drills using speech-recognition software to interactive dialogues and scenario-based vocabulary exercises. While the specific task types varied, all activities shared key characteristics of immersive VR: they required active sensory engagement (visual and auditory), involved physical interaction through head or hand tracking, and demanded real-time language processing. This heterogeneity reflects the naturalistic implementation of VR in the institution’s EFL program, where technology is used flexibly to support different learning objectives. By capturing learners’ aggregate engagement with these diverse tasks, the study examines the general relationship between immersive VR experiences and skill development, rather than isolating the effects of a specific pedagogical method. This approach enhances the ecological validity of the findings, as it mirrors how VR is typically integrated into broader language curricula.
Instruments
Data were collected using a structured self-report questionnaire consisting of four scales designed to capture learners’ engagement in VR-based language learning and their perceived perceptual, motor, and cognitive skills when performing VR-supported English tasks. All items were framed to refer explicitly to learners’ actual experiences with VR activities in English courses rather than to abstract or hypothetical uses of technology. The questionnaire was administered in English, with brief clarifications provided in Chinese to ensure comprehension. Responses were recorded on a five-point Likert scale ranging from 1 (strongly disagree) to 5 (strongly agree), with higher scores indicating higher levels of the target construct.
Prior to describing the specific scales, it is necessary to clarify the technical and contextual parameters of the VR environment used in this study, as these factors significantly influence learner engagement and performance. The VR activities integrated into the participants’ coursework were delivered via standalone immersive headsets (e.g., Meta Quest 2), providing a fully enclosed visual and auditory field rather than screen-based or desktop VR. This immersive setup required active physical interaction, including hand-tracking for gesture-based responses and head movement for visual exploration, thereby ensuring a high level of interactivity. Participants engaged with these VR tasks for approximately 45 minutes per week over a six-week period, totaling roughly 4.5 hours of exposure. This duration was consistent across all participants and was embedded within their regular weekly English language lessons. The tasks ranged from low-interactivity vocabulary drills, which required simple gaze selection, to high-interactivity scenario-based dialogues, which demanded simultaneous speech production and physical gesture synchronization. This variation in interactivity and exposure duration was intentional, reflecting the naturalistic use of VR in the institution’s EFL curriculum and allowing for a broader assessment of how different levels of immersion and motor engagement relate to perceptual, motor, and cognitive skill development.
VR Engagement Scale (VRES)
Learners’ engagement in VR-based language learning was measured using the VR Engagement Scale (VRES), which included 12 items across three components: frequency, immersion, and active participation. The frequency component assessed how often learners used VR tasks as part of their English learning routine. The immersion component captured the extent to which learners felt absorbed in VR environments and experienced a sense of presence during language tasks. The active participation component focused on learners’ behavioral involvement, such as interaction with tasks, initiative, and effort during VR activities. The scale was adapted from established measures of engagement and immersion in educational VR contexts, particularly the framework proposed by Makransky and Lilleholt (2018). Content adaptation was guided by EFL literature to ensure alignment with language learning tasks rather than general technology use. Construct validity of the VRES was examined through confirmatory factor analysis, which supported the three-factor structure. All factor loadings were acceptable and in the expected directions. Internal consistency reliability was satisfactory, with Cronbach’s alpha coefficients exceeding the commonly accepted threshold of .70 for the overall scale and each subscale.
Perceptual Skills in VR (PS-VR)
Perceptual skills related to VR-based language learning were assessed using the Perceptual Skills in VR scale (PS-VR), consisting of 12 items across three components: pronunciation, listening comprehension, and visual attention. The pronunciation component focused on learners’ perceived improvement in segmental and suprasegmental features through VR tasks. The listening comprehension component addressed learners’ ability to understand spoken English and follow dialogues in VR scenarios. The visual attention component examined learners’ capacity to attend to visual information and integrate visual and auditory cues during VR activities. Item development was informed by research on pronunciation and listening in second language learning (e.g., Derwing & Munro, 2005) and by studies on multimodal attention in technology-mediated environments. To address concerns about the absence of task-related detail, the PS-VR items were explicitly linked to common VR language activities used in the participants’ courses, such as pronunciation practice with speech models, listening to interactive dialogues, and responding to visually embedded prompts. These items captured learners’ perceptions of perceptual processing during actual VR tasks rather than general language ability. Factor analysis supported the three-component structure, and reliability analyses indicated acceptable internal consistency for the total scale and each subscale.
Motor Skills in VR (MS-VR)
Motor skills were measured using the Motor Skills in VR scale (MS-VR), which included 11 items covering hand–eye coordination, gesture or movement accuracy, and fine motor control. This scale focused on learners’ perceived motor responses while interacting with VR interfaces during English tasks, such as selecting objects, performing gestures linked to instructions, and typing or writing within VR environments. The conceptualization of motor skills was grounded in motor learning theory and skill acquisition research (Gentile, 2000), with items contextualized for language learning tasks rather than generic motor performance. Importantly, the scale did not treat motor skills as abstract traits. Instead, each item referred to concrete VR actions required in the language tasks used in the study, such as responding to prompts, synchronizing movements with language input, and executing precise hand movements. The factorial validity of the MS-VR was supported by empirical testing, and internal consistency coefficients indicated satisfactory reliability across components.
Cognitive Skills in VR (CS-VR)
Cognitive skills in VR-based language learning were assessed using the Cognitive Skills in VR scale (CS-VR), comprising 12 items across three components: real-time processing, working memory, and problem solving. Real-time processing items measured learners’ ability to understand and respond to English input under time constraints in VR scenarios. Working memory items focused on retaining and manipulating linguistic information during multi-step VR tasks. Problem-solving items examined learners’ use of strategies to infer meaning, adapt to task difficulty, and apply English creatively in VR contexts. The scale was informed by cognitive load theory and research on cognitive processing in learning environments (Sweller, 2011). To respond to concerns about the lack of task reporting, the CS-VR items were anchored in specific VR learning situations, such as responding to time-sensitive instructions and managing multiple sources of information during interactive scenarios. This ensured that cognitive skills were assessed in relation to actual VR task demands rather than assumed technological effects. Confirmatory factor analysis supported the proposed structure, and reliability indices demonstrated acceptable internal consistency for both the overall scale and subscales.
Procedure
Data collection took place in the latter half of the academic semester to ensure participants had adequate exposure to VR-based English learning. We obtained permissions from instructors and academic units, and participants provided electronic informed consent after being briefed on the study’s voluntary nature and confidentiality measures. Since VR activities were part of the regular coursework rather than a researcher-manipulated intervention, the study focused on learners’ reported experiences with existing tasks, such as pronunciation practice, interactive dialogues, and scenario-based exercises. It is important to clarify that the constructs of “perceptual skills” (e.g., pronunciation accuracy) and “motor skills” (e.g., hand-eye coordination during writing) in this study are operationalized as perceived task-related abilities based on self-report, rather than objective performance metrics. These measures reflect learners’ subjective assessment of their engagement with visual, auditory, and interactive elements specific to the VR environment, rather than general language proficiency or innate physical coordination. The online survey, completed individually outside class time to minimize pressure, included background questions followed by scales for VR engagement, perceptual, motor, and cognitive skills. Participants were instructed to base their responses on their actual experiences during the current semester. After screening for completeness and removing inconsistent cases, the final dataset was prepared for analysis in accordance with ethical guidelines.
Data Analysis
Data analysis was conducted in several stages using SPSS version 27 and AMOS version 24. Prior to model testing, the dataset was screened in SPSS for missing values, outliers, and distributional properties. Specifically, missing data were examined using Little’s Missing Completely at Random (MCAR) test, which indicated that the data were missing completely at random (χ2 (45) = 52.34, p = .19). Since missingness was minimal (less than 2% of total cases) and random, missing values were handled using expectation–maximization (EM) estimation to preserve statistical power and reduce bias. Univariate normality was examined through skewness and kurtosis values, which were within acceptable ranges (skewness between −1.0 and +1.0, kurtosis between −3.0 and +3.0). Multivariate normality was assessed using Mardia’s coefficient, which yielded a value of 12.45, slightly exceeding the conservative threshold of 10. However, given the large sample size and the robustness of maximum likelihood estimation to minor deviations from normality, this deviation was deemed acceptable. Additionally, univariate and multivariate outliers were screened using Mahalanobis distance (p < .001) and Cook’s distance, with no influential cases detected. Descriptive statistics and zero-order correlations among all study variables were computed in SPSS to provide an initial overview of the data and to examine the strength and direction of associations. Internal consistency reliability for each scale and subscale was assessed using Cronbach’s alpha coefficients. Structural equation modeling (SEM) was performed in AMOS to test the hypothesized relationships among VR engagement, perceptual skills, motor skills, and cognitive skills. A two-step modeling approach was followed. First, a measurement model was specified to evaluate the factorial validity of the latent constructs. Each latent variable was represented by its corresponding subscales as observed indicators. Confirmatory factor analysis was used to assess factor loadings, construct reliability, and convergent validity. Convergent validity was evaluated through standardized factor loadings and average variance extracted, while discriminant validity was examined by comparing the square roots of average variance extracted values with inter-construct correlations. In the second step, a structural model was specified to examine the direct and indirect relationships among the latent variables. VR engagement was modeled as the exogenous construct, while perceptual, motor, and cognitive skills were treated as endogenous constructs. Path coefficients were estimated using maximum likelihood estimation. Model fit was evaluated using multiple indices, including the chi-square statistic and its ratio to degrees of freedom, the comparative fit index (CFI), the Tucker–Lewis index (TLI), the root mean square error of approximation (RMSEA), and the standardized root mean square residual (SRMR). Commonly accepted cutoff criteria were used to judge model adequacy. Descriptive statistics and zero-order correlations among all study variables were computed in SPSS to provide an initial overview of the data and to examine the strength and direction of associations. Internal consistency reliability for each scale and subscale was assessed using Cronbach’s alpha coefficients. Structural equation modeling (SEM) was performed in AMOS to test the hypothesized relationships among VR engagement, perceptual skills, motor skills, and cognitive skills. A two-step modeling approach was followed. First, a measurement model was specified to evaluate the factorial validity of the latent constructs. Each latent variable was represented by its corresponding subscales as observed indicators. Confirmatory factor analysis was used to assess factor loadings, construct reliability, and convergent validity. Convergent validity was evaluated through standardized factor loadings and average variance extracted, while discriminant validity was examined by comparing the square roots of average variance extracted values with inter-construct correlations. In the second step, a structural model was specified to examine the direct and indirect relationships among the latent variables. VR engagement was modeled as an exogenous construct, while perceptual, motor, and cognitive skills were treated as endogenous constructs. Path coefficients were estimated using maximum likelihood estimation. Model fit was evaluated using multiple fit indices, including the chi-square statistic and its ratio to degrees of freedom, the comparative fit index (CFI), the Tucker–Lewis index (TLI), the root mean square error of approximation (RMSEA), and the standardized root mean square residual (SRMR). Commonly accepted cutoff criteria were used to judge model adequacy.
Results
Descriptive Statistics for Study Variables
Note. N = 248. All variables are measured on a 5-point Likert scale. Skewness and kurtosis values fall within the acceptable range for univariate normality (−1 to +1), indicating that the data distributions are approximately symmetric and mesokurtic.
Descriptive Statistics and Preliminary Analyses
Table 1 presents the descriptive statistics for the main latent constructs. Mean scores indicate moderate to moderately high levels across all variables, suggesting that participants reported meaningful engagement with VR-based language learning and perceived gains in perceptual, motor, and cognitive domains. Skewness and kurtosis values fell within acceptable ranges (±1), indicating no substantial deviation from normality. These results supported the use of maximum likelihood estimation in subsequent SEM analyses.
Internal Consistency Reliability of the Scales
Note. Cronbach’s alpha coefficients for all scales exceeded the recommended threshold of .70, indicating satisfactory internal consistency and reliability for the measurement instruments used in this study.
Pearson Correlations Among Latent Constructs
Note. All correlations are significant at p < .01. The strong positive associations among the variables suggest that higher levels of VR engagement are closely linked with greater perceived improvements in perceptual, motor, and cognitive skills. These bivariate relationships provide initial support for the hypothesized links between the constructs, which are further examined in the structural model.
A confirmatory factor analysis was conducted in AMOS to evaluate the measurement model.
Standardized Factor Loadings
Note. All standardized factor loadings are statistically significant (p < .01) and exceed the conventional threshold of .70, indicating strong convergent validity for each latent construct. The high loadings suggest that the observed indicators are robust measures of their respective underlying dimensions.
CR and AVE
Note. CR values for all constructs exceeded .70, indicating good internal consistency, while AVE values were all above .50, confirming adequate convergent validity. These results support the reliability and validity of the measurement model.
Discriminant Validity
Note. The square root of the Average Variance Extracted (AVE) for each construct is presented on the diagonal (in bold). All diagonal values are greater than the corresponding inter-construct correlations in the same row and column, indicating that discriminant validity is established for all latent variables.
To further ensure the distinctiveness of the latent constructs, particularly given the conceptual overlap between perceptual, motor, and cognitive domains, we employed the Heterotrait-Monotrait (HTMT) ratio of correlations as a more sensitive diagnostic for discriminant validity than the traditional Fornell-Larcker criterion. The HTMT values, which assess the ratio between the geometric mean of between-trait correlations and the geometric mean of within-trait correlations, were calculated for all pairs of latent variables. All HTMT values fell below the conservative threshold of 0.85, indicating that the constructs are empirically distinct and that multicollinearity is not a concern in the structural model. Additionally, we tested an alternative measurement model where perceptual and cognitive indicators were allowed to load on a single higher-order factor to check for potential redundancy. This constrained model showed a significant deterioration in fit compared to the proposed four-factor model (Δχ2 (6) = 45.32, p < .001), confirming that treating perceptual, motor, and cognitive skills as separate but related constructs provides a superior representation of the data. These rigorous diagnostics provide stronger evidence for the factorial validity of the measurement model beyond conventional fit indices.
Standardized Direct Effects
Note. All path coefficients are statistically significant at p < .001. The results indicate that VR engagement is a strong positive predictor of perceptual, motor, and cognitive skills, with the strongest effect observed for cognitive skills.

Structural model
Squared Multiple Correlations (R2)
Note. R2 values represent the proportion of variance in each endogenous variable explained by VR Engagement. The results indicate that VR engagement accounts for a moderate to substantial amount of variance in cognitive skills (46%) and perceptual skills (38%), while explaining a moderate portion of variance in motor skills (30%).
Discussion
The present study examined the relationships between immersive VR-based language learning, learner engagement, and EFL learners’ perceptual, motor, and cognitive skills using a structural equation modeling approach. The findings showed that VR engagement was a significant predictor of all three skill domains. Engagement demonstrated the strongest association with cognitive skills, followed by perceptual skills and motor skills. These results suggest that learners’ involvement in VR-based language tasks is closely related to how they process language input, coordinate actions, and manage task demands in immersive environments.
The measurement model results further confirmed that engagement in VR-based learning is not a unitary experience but involves frequency of use, immersion, and active participation. The structural model indicated that when learners reported higher levels of engagement, they also reported greater efficiency in real-time language processing, better perceptual sensitivity to linguistic cues, and more coordinated motor responses during VR tasks. These findings support the assumption that engagement functions as a central mechanism linking immersive environments to multiple dimensions of learner functioning rather than directly influencing language outcomes in isolation.
The results align with prior research reporting positive associations between immersive learning environments and learner engagement in EFL contexts (Lai & Chen, 2023; Weng et al., 2024; Yang et al., 2020). Similar to Chang et al. (2020) and Tai et al. (2022), the present study suggests that VR-supported activities can foster deeper involvement than traditional approaches. However, unlike many earlier studies that focused primarily on language performance outcomes, this study provides evidence that engagement relates to broader skill domains that support language learning. The strong relationship between engagement and cognitive skills is consistent with research emphasizing the role of immersive environments in supporting attention, working memory, and real-time processing (Keller et al., 2024; Lin et al., 2022). At the same time, the findings extend earlier work by modeling cognitive skills as a distinct latent construct rather than treating cognitive processing as an implicit outcome. This addresses concerns raised by Schmidt et al. (2023) and Chen et al. (2022) regarding the need for clearer operationalization of learning processes in immersive studies.
The association between VR engagement and perceptual skills supports findings from studies on pronunciation and listening development in immersive settings (Lin & Wang, 2021; Xu et al., 2023). However, the present results suggest that perceptual gains are closely tied to learners’ level of engagement rather than exposure alone. This contrasts with some earlier research that attributed perceptual improvement mainly to technological affordances (Hsu, 2017; Zhonggen, 2018) without accounting for learner involvement. The relationship between engagement and motor skills provides partial support for embodied learning perspectives (Bendeck Soto et al., 2020; Franco et al., 2025). While the effect size was weaker than for cognitive and perceptual skills, the results indicate that active interaction and coordinated movement are relevant components of VR-based language learning. This finding responds to calls by Govender and Arnedo-Moreno (2021) and Dhimolea et al. (2022) for more explicit attention to motor interaction in immersive language studies.
This study contributes to the literature in several significant ways. First, it offers an integrative model that links VR engagement with perceptual, motor, and cognitive skills within a single analytical framework. This addresses a notable gap in prior research, which has often examined these dimensions in isolation or treated motor and perceptual outcomes as secondary to cognitive gains. By testing them simultaneously, this study demonstrates that engagement in immersive environments supports a holistic set of learner abilities, not just cognitive processing. Second, by employing structural equation modeling with a substantial sample size, the study provides robust empirical evidence for the central role of engagement as a multidimensional predictor of multiple learner skills in VR-based EFL contexts. This moves beyond simple correlation to clarify the relative strength of these relationships.
Another key contribution lies in the careful operationalization of constructs. Rather than labeling tasks as “immersive” by default based on hardware alone, this study assessed learners’ reported experiences with concrete, task-specific VR activities. This approach responds directly to recent critiques concerning construct validity and the overgeneralization of “immersion” in immersive learning research (Derakhshan et al., 2024; Schmidt et al., 2023). By grounding the measures in actual learner experiences, the study offers a more nuanced understanding of how specific types of engagement (e.g., active participation vs. passive immersion) relate to skill development. Finally, the findings add large-sample evidence from a Chinese EFL context, where VR adoption in higher education is expanding rapidly but systematic, large-scale research remains limited (Li et al., 2025; Zhang et al., 2024). This geographic and cultural context provides valuable comparative data for the global EFL research community, highlighting how VR engagement functions in non-Western educational settings.
Implications
From a theoretical perspective, the findings support engagement-based models of technology-enhanced language learning, which view engagement as a mediator between instructional design and learning-related processes (Alalwan et al., 2020; Moon et al., 2020). The strong links between engagement and cognitive skills also align with cognitive load theory, suggesting that immersive environments may support learning when engagement helps learners manage processing demands (Keller et al., 2024; Lin et al., 2022). The results further provide partial support for embodied cognition accounts by demonstrating a link between engagement and motor skills. However, the relatively smaller effect size indicates that motor interaction alone may not guarantee learning benefits, echoing concerns raised by Wang et al. (2025) and Wei et al. (2025) regarding task complexity and cognitive overload. Overall, the findings suggest that theoretical models of VR-based language learning should integrate engagement, cognitive processing, and embodied interaction rather than privileging a single perspective.
The findings have several implications for EFL instruction and curriculum design. First, VR-based language learning should prioritize engagement through meaningful tasks rather than focusing solely on technological novelty. Tasks that encourage active participation and sustained attention may be more effective in supporting cognitive and perceptual skills. Second, instructors should be aware that motor interaction can support learning when it is aligned with language objectives, but excessive or poorly designed interaction may distract learners. For teacher education, the results highlight the need to prepare instructors to integrate VR tasks with clear pedagogical goals. Institutions considering VR adoption should also evaluate how engagement is supported across tasks and learner levels rather than assuming uniform benefits. These implications align with recent calls for pedagogically grounded use of immersive technologies in language education (Ghafouri et al., 2025; Yazdi & Ghanizadeh, 2024).
Limitations
Several limitations should be acknowledged. First, the study relied on self-report measures, which reflect learners’ perceptions rather than direct assessments of perceptual, motor, or cognitive performance. Second, the cross-sectional design limits causal interpretation of the observed relationships. Third, although the sample size was large, participants were drawn from a single national context, which may limit generalizability. Finally, variation in VR task design across courses was not experimentally controlled, which may have influenced learners’ reported experiences.
Suggestions for Future Research
Future studies should combine self-report data with behavioral or performance-based measures to capture perceptual, motor, and cognitive skills more directly. Longitudinal or experimental designs would help clarify causal relationships between engagement and skill development. Further research could also examine moderating variables such as proficiency level, task complexity, or instructional support. Comparative studies across different EFL contexts may provide additional insight into how cultural and institutional factors shape VR-based language learning experiences.
Conclusion
This study examined the relationship between immersive VR-based language learning and EFL learners’ perceptual, motor, and cognitive skills, with a particular focus on the role of learner engagement. Using a structural equation modeling approach, the findings showed that engagement in VR-based language tasks was significantly associated with all three skill domains. Among these, cognitive skills demonstrated the strongest relationship with engagement, followed by perceptual and motor skills. These results suggest that VR-based learning environments are most effective when learners are actively involved in tasks that require sustained attention, real-time processing, and purposeful interaction. By modeling engagement as a central construct, the study extends prior VR research that has often emphasized learning outcomes without examining the processes that support them. The results highlight that engagement is not a peripheral feature of immersive learning but a key mechanism through which VR environments relate to learners’ experiences and perceived skill development. This perspective contributes to ongoing discussions in EFL research regarding the need to move beyond technology-driven explanations and toward process-oriented accounts of learning in digital environments.
The study also responds to recent methodological concerns in immersive learning research by grounding its constructs in learners’ reported experiences with concrete VR language tasks. Rather than assuming perceptual, motor, or cognitive involvement as inherent features of VR, the study provides empirical evidence that these dimensions are meaningfully connected to how learners engage with VR activities. In doing so, it offers a more cautious and transparent interpretation of VR-related effects in language education. Despite its contributions, the study should be interpreted in light of its limitations, including its reliance on self-report data and cross-sectional design. Future research is encouraged to integrate objective measures of learner performance and to employ longitudinal or experimental designs to clarify causal relationships. Nevertheless, the findings underscore the importance of pedagogically grounded VR task design and offer empirical support for engagement-focused approaches to immersive language learning.
Footnotes
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
