Abstract
Pedestrian intention prediction is critical for safe and socially intelligent autonomous vehicle (AV) operation in urban environments. While many existing models assume that pedestrian actions directly reflect intentions, the temporal dynamics between inferred intentions and observable actions remain underexplored. This study investigates the temporal relationship between inferred intention and actions of pedestrians. We used the Pedestrian Situated Intention (PSI) dataset, which includes frame-wise annotations of pedestrian behavior and aggregated human inferences of crossing intention from 24 raters. We analyzed time lag and the influence of preparatory actions like looking across five behavioral categories, such as pedestrians crossing from standing or not crossing by slowing down. Findings suggest that intention prediction, timing and looking behavior varies based on the specific scenario and that critical actions such as greeting can help predict intentions. The goal is to develop better models that allow autonomous vehicles more time to react to pedestrian movements.
Introductions
Background
With the rise of AI technologies, AVs can perform well in simple conditions such as highways. However, when it comes to urban environments where mixed traffic and pedestrian interactions are common, it is still difficult for AVs to navigate safely and predictably due to the complex and dynamic nature of human behavior (Boggs et al., 2020). To handle pedestrian interaction challenges, AVs rely on trajectory and behavior prediction models, which are typically based on either physics-driven estimations or learning-based future predictions (Chen & Tian, 2021; Li et al., 2021). While those prediction algorithms are developing fast, most models focus on observable actions such as trajectories and postures, without understanding the pedestrian’s underlying intention—which may precede visible actions and offer a longer reaction window for AV decisions.
Related Studies
In human driving, understanding the intention of others is essential for smooth social coordination (Pelikan, 2021). Prior research has proposed viewing pedestrian behavior prediction through the lens of intention recognition, defining intention as a pedestrian’s desire or goal to cross (Rasouli et al., 2019). However, most datasets and models study intention using coarse behavioral surrogates (e.g., walking = intending to cross), and few consider fine-grained dynamics or the shifts of intention across time.
Gaps in Intention Prediction and Crossing Action
In the field, the temporal relationship between inferred pedestrian intentions and their future actions remains underexplored. Most pedestrian intention prediction models assume that actions reflect current intention or treat them as occurring simultaneously. However, this simplification overlooks the subtle preparatory behaviors (e.g., gaze, hesitation) that often precede physical movement, limiting the ability to anticipate behavior in advance.
This study builds on the assumption that intention typically precedes over action, a view supported by multiple cognitive theories. Motor Preparation Theory proposes that motor intentions are formed in the brain before any physical movement occurs (Jeannerod, 2006). According to the Theory of Mind (ToM), intention can be inferred from visible cues and helps predict future actions (Premack & Woodruff, 1978). Event Segmentation Theory posits that humans identify event boundaries when prediction errors occur during action observation (Zacks et al., 2007).
Based on the above cognitive theories, this study tries to align the intention prediction timing with future actions and to demonstrate that if we imitate intention prediction, we may be able to anticipate future actions in more advanced timing and benefit AV decision-making for allowing more time to prepare for actions. To be specific, in this paper, we try to explore the following questions:
How early can human-labeled pedestrian intention reliably predict actual crossing behavior?
How do preparatory behaviors (e.g., gaze, hesitation, nodding) influence the alignment between intention predictions and crossing actions?
How does the temporal lag between intention and action vary across different pedestrian behavioral scenarios?
Approach
Research Data
We use the Pedestrian Situated Intention (PSI) dataset, which provides higher quality intention estimates along with preparatory actions such as looking, speeding up, walking, and nodding (Chen et al., 2022). The dataset has dynamic inferred intention changes for the pedestrians to cross in front of the ego-vehicle. The intentions data inside the PSI dataset is collected from 24 subjects with diverse backgrounds, ensuring the representativeness of pedestrian intention estimation results. Meanwhile, the dataset also labeled pedestrians’ actual actions during the interactions with ego-vehicle, such as “Walking Direction Change,” “Slow Down,” and “Speed Up.” In this research, we will use common actions including “Walking/Not Walking,” “Slow Down/Speed Up,” and “Looking/Not Looking” to conduct the experiments.
Behavioral Classes and Critical Actions
To test our hypotheses, we examined the temporal lag between intention labels and actual pedestrian actions across five behavioral classes, shown in table 1.
Five Pedestrian Behavior Classifications Based on Crossing Outcome and Movement Dynamic.
Calculation Methods
For intention analysis, we used the averaged inferred intentions from 24 annotators in the PSI dataset to represent predicted pedestrian intention. We assigned numerical values to systematically measure the transitions of intentions. Intentions were represented as “Crossing” (1), “Not Sure” (0.5), and “Not Crossing” (0). For each frame, the aggregated intention score was computed as the average across the annotators. For example, if 10 chose “Cross”, 10 chose “Not Sure”, and 4 chose “Not Cross”, the resulting score would be (10 × 1 + 10 × 0.5 + 4 × 0) / 24 = 0.625. We defined two thresholds to detect significant intention shifts: a crossing threshold (>0.67) and a non-crossing threshold (<0.33).
We conduct two types of analysis. Firstly, time lag analysis is performed to assess the difference in walking behaviors relative to intention. The time lag is defined as the frame difference between the moment when pedestrian intention becomes clearly inferred and the moment a critical action occurs:
Outcome
We analyzed 111 cases from the five classes, with results reported below.
Time lag analysis
Table 2 and Figure 1 summarize the results of Time Lag Analysis. The mean time lag varies across classes. Notably, Class 3 (mean = –1.01s) and Class 4 (mean = –0.98s) exhibited negative time lags, suggesting that crossing intentions could be inferred before clear movement occurred. In contrast, Class 1, 2, and 5 had positive mean time lags, implying that crossing or non-crossing decisions were generally interpreted after initial body movement was observed.
Time Lag and Total Looking Time Across Five Pedestrian Behavior Classes.
Note. Means, standard deviations, and 95% confidence intervals are reported. ANOVA shows significant differences in looking time (p = .0039), but not in time lag (p = .1372).

Distribution of time lags across five classes.
While the overall ANOVA for time lag was not statistically significant (F = 1.785, p = .137), examination of 95% confidence intervals (CIs) reveals class-specific trends. Class 3 (Walking-to-Crossing) showed a CI of [–1.95 s, –0.06 s], suggesting that inferred intention preceded observable crossing action and walking behavior toward the ego-vehicle allows early detection of intention. Similarly, Class 4 (Walking-to-Speeding-Up) had a CI of [–1.75 s, –0.22 s], showing that intention to cross is formed prior to the observable act of acceleration.
In contrast, Classes 1, 2, and 5 had CIs that included zero, such as Class 1 [–1.09 s, 1.85 s] and Class 5 [–1.24 s, 2.32 s], indicating greater variability and weaker temporal alignment in scenarios involving hesitation or delayed crossing. These findings highlight that while not all pedestrian behaviors allow reliable anticipation, crossing patterns involving steady or accelerating motion are more temporally aligned with prior intention, making them more predictable for proactive AV response systems.
Total Looking Time Analysis
To investigate how gaze behavior varies with pedestrian crossing decisions, we analyzed total looking time—the cumulative duration that pedestrians looked toward the ego-vehicle—across five behavioral classes. A one-way ANOVA, as is shown in Table 2 revealed a significant main effect of behavior class on total looking time (F (4,108) = 4.104, p = .0039), indicating that gaze duration differs systematically based on pedestrian action patterns.
Post-hoc Tukey HSD tests further clarified these differences, as is shown in Figure 2. Class 1 (Walking to Slow Down—Not Crossing) showed significantly longer looking times than Class 3 (Walking to Crossing—Crossing) (p = .0117). Similarly, Class 4 (Walking to Speed Up—Crossing) also exhibited longer gaze durations than Class 3 (p = .0128). No other class pairings showed statistically significant differences after multiple comparison correction. These results suggest that pedestrians who hesitate (Class 1) or accelerate into a crossing (Class 4) engage in more frequent or sustained visual monitoring of the approaching vehicle—likely reflecting uncertainty or an effort to coordinate with the driver. In contrast, Class 3 pedestrians, who walk steadily across without modifying pace, show significantly shorter gaze durations, possibly indicating more automatic or committed crossing behavior.

Tukey HSD test results comparing total looking time among pedestrian behavior classes.
Intention Change Around Critical Actions
To explore how pedestrian actions influence perceived crossing intention, we conducted a windowed shift analysis. For each critical action (e.g., greeting, slowing down), we calculated the mean aggregated intention score within a 0.8-second window before and after the action. This window size is informed by Oxley et al. (2005), who reported pedestrians typically require ~1 s to perceive a safe gap and initiate crossing, capturing both decision and motor components.
We applied paired t-tests to compare average intention values before and after each action type. As shown in Figure 3, “greet,” “slow down,” and “speed up” are associated with significant increases in inferred intention (p < .05), indicating that these actions often follow reliable intention inferences. In contrast, “walking direction changes” had no significant effect, suggesting they may not serve as reliable intention cues.

Aggregated crossing intention scores before and after key pedestrian actions, with p-values from paired t-tests included.
Discussion
While our confidence interval analysis revealed class-specific trends, the overall ANOVA on time lag did not reach statistical significance. One likely reason is limited and uneven data distribution across behavioral classes. The overall sample size is limited, and data are not equally distributed across the behavior classes. Additionally, during annotation we observed that pedestrians rarely remain fully stationary when deciding whether to cross. Most individuals continue subtle leg or body movements even when appearing to “stand,” resulting in a low number of true standing-to-walking transitions.
For more robust conclusions, future work should expand the dataset, refine behavior categorization, and incorporate additional preparatory cues beyond those currently labeled. Introducing nuanced behaviors—such as shifting weight, pausing, or head turns—may better capture the subtle cues that precede crossing decisions and improve the temporal alignment between intention and action in real-world contexts.
Conclusion
This study highlights the importance of modeling the temporal dynamics between pedestrian intentions and actions. While intention inference precedes crossing in consistent movement scenarios, variability across behaviors and limited data constrained overall significance in other behavior classes. Preparatory actions like greeting and looking showed clear value for early intention recognition. Future work should incorporate larger datasets and richer behavior cues to improve the timing and reliability of pedestrian intention prediction in autonomous systems.
Footnotes
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Science Foundation under Grant No. 2516587.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
