When Do They Cross? Temporal Dynamics of Pedestrian Intention Prediction and Crossing Actions

Abstract

Pedestrian intention prediction is critical for safe and socially intelligent autonomous vehicle (AV) operation in urban environments. While many existing models assume that pedestrian actions directly reflect intentions, the temporal dynamics between inferred intentions and observable actions remain underexplored. This study investigates the temporal relationship between inferred intention and actions of pedestrians. We used the Pedestrian Situated Intention (PSI) dataset, which includes frame-wise annotations of pedestrian behavior and aggregated human inferences of crossing intention from 24 raters. We analyzed time lag and the influence of preparatory actions like looking across five behavioral categories, such as pedestrians crossing from standing or not crossing by slowing down. Findings suggest that intention prediction, timing and looking behavior varies based on the specific scenario and that critical actions such as greeting can help predict intentions. The goal is to develop better models that allow autonomous vehicles more time to react to pedestrian movements.

Keywords

pedestrian intention prediction human behavior modeling event segmentation

Introductions

Background

With the rise of AI technologies, AVs can perform well in simple conditions such as highways. However, when it comes to urban environments where mixed traffic and pedestrian interactions are common, it is still difficult for AVs to navigate safely and predictably due to the complex and dynamic nature of human behavior (Boggs et al., 2020). To handle pedestrian interaction challenges, AVs rely on trajectory and behavior prediction models, which are typically based on either physics-driven estimations or learning-based future predictions (Chen & Tian, 2021; Li et al., 2021). While those prediction algorithms are developing fast, most models focus on observable actions such as trajectories and postures, without understanding the pedestrian’s underlying intention—which may precede visible actions and offer a longer reaction window for AV decisions.

Related Studies

In human driving, understanding the intention of others is essential for smooth social coordination (Pelikan, 2021). Prior research has proposed viewing pedestrian behavior prediction through the lens of intention recognition, defining intention as a pedestrian’s desire or goal to cross (Rasouli et al., 2019). However, most datasets and models study intention using coarse behavioral surrogates (e.g., walking = intending to cross), and few consider fine-grained dynamics or the shifts of intention across time.

Gaps in Intention Prediction and Crossing Action

In the field, the temporal relationship between inferred pedestrian intentions and their future actions remains underexplored. Most pedestrian intention prediction models assume that actions reflect current intention or treat them as occurring simultaneously. However, this simplification overlooks the subtle preparatory behaviors (e.g., gaze, hesitation) that often precede physical movement, limiting the ability to anticipate behavior in advance.

This study builds on the assumption that intention typically precedes over action, a view supported by multiple cognitive theories. Motor Preparation Theory proposes that motor intentions are formed in the brain before any physical movement occurs (Jeannerod, 2006). According to the Theory of Mind (ToM), intention can be inferred from visible cues and helps predict future actions (Premack & Woodruff, 1978). Event Segmentation Theory posits that humans identify event boundaries when prediction errors occur during action observation (Zacks et al., 2007).

Based on the above cognitive theories, this study tries to align the intention prediction timing with future actions and to demonstrate that if we imitate intention prediction, we may be able to anticipate future actions in more advanced timing and benefit AV decision-making for allowing more time to prepare for actions. To be specific, in this paper, we try to explore the following questions:

How early can human-labeled pedestrian intention reliably predict actual crossing behavior?

How do preparatory behaviors (e.g., gaze, hesitation, nodding) influence the alignment between intention predictions and crossing actions?

How does the temporal lag between intention and action vary across different pedestrian behavioral scenarios?

Approach

Research Data

We use the Pedestrian Situated Intention (PSI) dataset, which provides higher quality intention estimates along with preparatory actions such as looking, speeding up, walking, and nodding (Chen et al., 2022). The dataset has dynamic inferred intention changes for the pedestrians to cross in front of the ego-vehicle. The intentions data inside the PSI dataset is collected from 24 subjects with diverse backgrounds, ensuring the representativeness of pedestrian intention estimation results. Meanwhile, the dataset also labeled pedestrians’ actual actions during the interactions with ego-vehicle, such as “Walking Direction Change,” “Slow Down,” and “Speed Up.” In this research, we will use common actions including “Walking/Not Walking,” “Slow Down/Speed Up,” and “Looking/Not Looking” to conduct the experiments.

Behavioral Classes and Critical Actions

To test our hypotheses, we examined the temporal lag between intention labels and actual pedestrian actions across five behavioral classes, shown in table 1.

Table 1.

Five Pedestrian Behavior Classifications Based on Crossing Outcome and Movement Dynamic.

Behavioral classes	Crossing intention	Behavior change	Description
Class 1	Not Crossing	Walking-to-Slowing Down	The pedestrian does not cross in front of the ego vehicle. The pedestrian slows down and decides not to cross. The critical actions to study are the first transitions where pedestrians begin to “Slow Down”.
Class 2	Not Crossing	Walking-to-Standing	The Pedestrian does not cross in front of the ego vehicle. The pedestrian changes from walking to standing and decides not to cross. The critical action to study is the first transition where pedestrians stop walking.
Class 3	Crossing	Walking-to-Crossing	The pedestrian crosses in front of the ego vehicle. The Pedestrian is walking before deciding to cross. The critical actions to study are the first transitions when pedestrians walk into the ego vehicle’s corridor.
Class 4	Crossing	Walking-to-Speed Up	The pedestrian crosses in front of the ego vehicle in this class. The pedestrian is walking before deciding to cross. The critical actions to study are the first transition where pedestrians begin to speed up.
Class 5	Crossing	Standing-to-Crossing	The pedestrian crosses in front of the ego vehicle in this class. The pedestrian is standing before deciding to cross. The critical actions to study are the first transition where pedestrians begin walking from standing.

Calculation Methods

For intention analysis, we used the averaged inferred intentions from 24 annotators in the PSI dataset to represent predicted pedestrian intention. We assigned numerical values to systematically measure the transitions of intentions. Intentions were represented as “Crossing” (1), “Not Sure” (0.5), and “Not Crossing” (0). For each frame, the aggregated intention score was computed as the average across the annotators. For example, if 10 chose “Cross”, 10 chose “Not Sure”, and 4 chose “Not Cross”, the resulting score would be (10 × 1 + 10 × 0.5 + 4 × 0) / 24 = 0.625. We defined two thresholds to detect significant intention shifts: a crossing threshold (>0.67) and a non-crossing threshold (<0.33).

We conduct two types of analysis. Firstly, time lag analysis is performed to assess the difference in walking behaviors relative to intention. The time lag is defined as the frame difference between the moment when pedestrian intention becomes clearly inferred and the moment a critical action occurs:

T i m e L a g = T_{i n t e n t} - T_{a c t i o n}

(1)

$T_{i n t e n t}$ is the first frame at which the aggregated intention score crosses a defined threshold. $T_{a c t i o n}$ is the critical action frame (e.g., start of slowing down, stopping, or crossing). A negative time lag indicates that intention precedes action, while a positive time lag suggests that action occurs before intention is inferred. Additionally, gaze behavior was analyzed for moderating on time lags. Lastly, we compared intention scores before and after key action onsets to determine whether observable pedestrian behaviors—such as greeting, slowing down, speeding up, or changing walking direction—are preceded by a significant increase in crossing intention. For each behavior, we calculated the mean intention score within a short window before and after the action onset and performed paired t-tests to assess whether these changes were statistically significant.

Outcome

We analyzed 111 cases from the five classes, with results reported below.

Time lag analysis

Table 2 and Figure 1 summarize the results of Time Lag Analysis. The mean time lag varies across classes. Notably, Class 3 (mean = –1.01s) and Class 4 (mean = –0.98s) exhibited negative time lags, suggesting that crossing intentions could be inferred before clear movement occurred. In contrast, Class 1, 2, and 5 had positive mean time lags, implying that crossing or non-crossing decisions were generally interpreted after initial body movement was observed.

Table 2.

Time Lag and Total Looking Time Across Five Pedestrian Behavior Classes.

Classes	Time Lag (s)		Confidence interval (Time Lag)		Total looking time
Classes	Mean	Std	CI_lower	CI_upper	Mean	Std	Count
Class 1: Walking-to-slowing down	0.380	2.656	−1.085	1.845	3.069	2.013	15
Class 2: Walking-to-standing	0.456	1.766	−0.902	1.813	2.489	1.020	9
Class 3: Walking-to-crossing	−1.005	3.061	−1.948	−0.063	1.230	1.798	43
Class 4: Walking-to-speed up	−0.984	2.199	−1.751	−0.217	2.659	2.072	34
Class 5: Standing-to-crossing	0.540	2.483	−1.236	2.316	2.310	2.206	10
	ANOVA: Time Lag: F = 1.785, p = 1.372e-01Total Looking Time: F = 4.104, p = 3.919e-03

Note. Means, standard deviations, and 95% confidence intervals are reported. ANOVA shows significant differences in looking time (p = .0039), but not in time lag (p = .1372).

Figure 1.

Distribution of time lags across five classes.

While the overall ANOVA for time lag was not statistically significant (F = 1.785, p = .137), examination of 95% confidence intervals (CIs) reveals class-specific trends. Class 3 (Walking-to-Crossing) showed a CI of [–1.95 s, –0.06 s], suggesting that inferred intention preceded observable crossing action and walking behavior toward the ego-vehicle allows early detection of intention. Similarly, Class 4 (Walking-to-Speeding-Up) had a CI of [–1.75 s, –0.22 s], showing that intention to cross is formed prior to the observable act of acceleration.

In contrast, Classes 1, 2, and 5 had CIs that included zero, such as Class 1 [–1.09 s, 1.85 s] and Class 5 [–1.24 s, 2.32 s], indicating greater variability and weaker temporal alignment in scenarios involving hesitation or delayed crossing. These findings highlight that while not all pedestrian behaviors allow reliable anticipation, crossing patterns involving steady or accelerating motion are more temporally aligned with prior intention, making them more predictable for proactive AV response systems.

Total Looking Time Analysis

To investigate how gaze behavior varies with pedestrian crossing decisions, we analyzed total looking time—the cumulative duration that pedestrians looked toward the ego-vehicle—across five behavioral classes. A one-way ANOVA, as is shown in Table 2 revealed a significant main effect of behavior class on total looking time (F (4,108) = 4.104, p = .0039), indicating that gaze duration differs systematically based on pedestrian action patterns.

Post-hoc Tukey HSD tests further clarified these differences, as is shown in Figure 2. Class 1 (Walking to Slow Down—Not Crossing) showed significantly longer looking times than Class 3 (Walking to Crossing—Crossing) (p = .0117). Similarly, Class 4 (Walking to Speed Up—Crossing) also exhibited longer gaze durations than Class 3 (p = .0128). No other class pairings showed statistically significant differences after multiple comparison correction. These results suggest that pedestrians who hesitate (Class 1) or accelerate into a crossing (Class 4) engage in more frequent or sustained visual monitoring of the approaching vehicle—likely reflecting uncertainty or an effort to coordinate with the driver. In contrast, Class 3 pedestrians, who walk steadily across without modifying pace, show significantly shorter gaze durations, possibly indicating more automatic or committed crossing behavior.

Figure 2.

Tukey HSD test results comparing total looking time among pedestrian behavior classes.

Intention Change Around Critical Actions

To explore how pedestrian actions influence perceived crossing intention, we conducted a windowed shift analysis. For each critical action (e.g., greeting, slowing down), we calculated the mean aggregated intention score within a 0.8-second window before and after the action. This window size is informed by Oxley et al. (2005), who reported pedestrians typically require ~1 s to perceive a safe gap and initiate crossing, capturing both decision and motor components.

We applied paired t-tests to compare average intention values before and after each action type. As shown in Figure 3, “greet,” “slow down,” and “speed up” are associated with significant increases in inferred intention (p < .05), indicating that these actions often follow reliable intention inferences. In contrast, “walking direction changes” had no significant effect, suggesting they may not serve as reliable intention cues.

Figure 3.

Aggregated crossing intention scores before and after key pedestrian actions, with p-values from paired t-tests included.

Discussion

While our confidence interval analysis revealed class-specific trends, the overall ANOVA on time lag did not reach statistical significance. One likely reason is limited and uneven data distribution across behavioral classes. The overall sample size is limited, and data are not equally distributed across the behavior classes. Additionally, during annotation we observed that pedestrians rarely remain fully stationary when deciding whether to cross. Most individuals continue subtle leg or body movements even when appearing to “stand,” resulting in a low number of true standing-to-walking transitions.

For more robust conclusions, future work should expand the dataset, refine behavior categorization, and incorporate additional preparatory cues beyond those currently labeled. Introducing nuanced behaviors—such as shifting weight, pausing, or head turns—may better capture the subtle cues that precede crossing decisions and improve the temporal alignment between intention and action in real-world contexts.

Conclusion

This study highlights the importance of modeling the temporal dynamics between pedestrian intentions and actions. While intention inference precedes crossing in consistent movement scenarios, variability across behaviors and limited data constrained overall significance in other behavior classes. Preparatory actions like greeting and looking showed clear value for early intention recognition. Future work should incorporate larger datasets and richer behavior cues to improve the timing and reliability of pedestrian intention prediction in autonomous systems.

Footnotes

ORCID iD

Renran Tian

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Science Foundation under Grant No. 2516587.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Boggs

A. M.

Wali

Khattak

A. J.

(2020). Exploratory analysis of automated vehicle crashes in California: A text analytics & hierarchical Bayesian heterogeneity-based approach. Accident Analysis & Prevention, 135, 105354. https://doi.org/10.1016/j.aap.2019.105354

Chen

Jing

Tian

Chen

Domeyer

Toyoda

Sherony

Ding

(2022). PSI: A pedestrian behavior dataset for socially intelligent autonomous car. arXiv. https://arxiv.org/abs/2112.02604

Chen

Tian

(2021). A survey on deep-learning methods for pedestrian behavior prediction from the egocentric view. 2021 IEEE International Conference on Intelligent Transportation Systems (ITSC), 1898–1905. https://doi.org/10.1109/ITSC48978.2021.9565041

Jeannerod

(2006). Motor cognition: What actions tell the self. Oxford University Press.

Eiffert

Shan

Gomez-Donoso

Worrall

Nebot

(2021). Attentional-GCNN: Adaptive pedestrian trajectory prediction towards generic autonomous vehicle use cases. In 2021 IEEE International Conference on Robotics and Automation (ICRA) (pp. 14241–14247). IEEE. https://doi.org/10.1109/ICRA48506.2021.9561480

Oxley

Ihsen

Fildes

Charlton

Day

(2005). Crossing roads safely: An experimental study of age differences in gap selection by pedestrians. Accident Analysis & Prevention, 37(5), 962–971. https://doi.org/10.1016/j.aap.2005.04.017

Pelikan

H. R. M.

(2021). Why autonomous driving is so hard: The social dimension of traffic. In Companion of the 2021 ACM/IEEE international conference on human-robot interaction (pp. 81–85). Association for Computing Machinery. https://doi.org/10.1145/3434074.34471333

Premack

Woodruff

(1978). Does the chimpanzee have a theory of mind? Behavioral and Brain Sciences, 1(4), 515–526. https://doi.org/10.1017/S0140525X00076512

Rasouli

Kotseruba

Kunic

Tsotsos

J. K.

(2019). PIE: A large-scale dataset and models for pedestrian intention estimation and trajectory prediction. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6262–6271).

10.

Zacks

J. M.

Speer

N. K.

Swallow

K. M.

Braver

T. S.

Reynolds

J. R.

(2007). Event perception: A mind-brain perspective. Psychological Bulletin, 133(2), 273–293. https://doi.org/10.1037/0033-2909.133.2.273