Enhancing Taxi Demand Prediction with Limited Data using a Spatial-Temporal Large Language Model

Abstract

Taxi demand prediction is essential for intelligent transportation systems. Accurate prediction results help address the issue of supply–demand imbalances and enable more efficient traffic management. Significant advances have been made in traffic demand prediction, particularly through the use of deep learning models. However, these models heavily rely on a large amount of data. Data scarcity remains a significant challenge because of high acquisition and storage costs, as well as data sparsity in certain locations and times. Thus, this study proposes a novel taxi demand prediction model that leverages the large language model GPT-2 to capture complex spatio-temporal dependencies. By integrating spatial correlations through a graph attention network and incorporating temporal dependencies at multiple scales, the proposed spatio-temporal taxi demand prediction large model (STTDP-LM) is capable of achieving accurate prediction with limited training data. Extensive experiments validate its effectiveness across two districts in Xi’an. Compared to the baseline method, the STTDP-LM reduces the root mean square error (RMSE), mean absolute percentage error (MAPE), and mean absolute error (MAE) by an average of 12.25%, 12.55%, and 18.33%, respectively, across the two districts. When trained with only 1% of the data, the model still shows significant improvement, with average reductions of 33.83%, 34.12%, and 17.03% in the RMSE, MAE, and MAPE, respectively. The prediction accuracy of the model is more prominent in multi-step prediction with a total duration of 60 min. In summary, this study offers a promising solution for taxi demand prediction with limited historical data, providing a valuable insight for real-world applications in intelligent transportation systems.

Keywords

taxi demand prediction spatio-temporal modeling large language models graph attention networks intelligent transportation

Taxi transportation plays a vital role in intelligent transportation systems, offering accessible and efficient travel options for passengers. However, taxi service struggles with the challenge of supply–demand imbalance, driven by reasons on both sides ( 1 ). On the supply side, drivers often rely on intuition and experience, leading to inefficient searching behaviors. On the demand side, the travel demand of passengers often fluctuates and is random. This imbalance results in higher cruising times, increased fuel consumption, and longer wait time for passengers. Consequently, establishing an accurate and reliable model to predict taxi demand is crucial for mitigating the imbalance, optimizing resource allocation, and improving traffic management ( 2 ).

The primary objective of taxi demand prediction is to forecast the number of taxi requests in a specific area for a future time period using historical demand data. This challenge has attracted considerable research interest. Various prediction methods have been proposed, with traditional studies predominantly relying on empirical statistics and classical machine learning techniques. More recently, deep learning methods, capable of capturing nonlinear spatial and temporal correlations in taxi demand data, have shown significant success ( 3 , 4 ).

Nevertheless, deep learning models typically require substantial amounts of data. The accuracy of deep learning-based approaches generally improves with an increase in the available training data. However, data scarcity remains a critical challenge, particularly in less developed urban areas. In some cases, collecting sufficient historical data is challenging because of high costs and data sparsity in certain locations and times. This limitation makes it difficult to build a comprehensive taxi demand dataset in many regions, hindering the performance of deep learning models ( 5 , 6 ). As a result, it is essential to develop innovative solutions that address both data scarcity and improve the prediction capabilities of prediction models with limited historical demand data.

The rapid development of large language models (LLMs) has significantly expanded their potential applications beyond natural language processing, offering a promising approach to a wide range of spatio-temporal traffic prediction tasks ( 7 , 8 ). LLMs are deep learning models that have been trained on large, high-quality generalized datasets to capture universal patterns across diverse domains. With their powerful few-shot learning capability and proficiency in cross-modality knowledge transfer, they can be adapted to specific tasks in data-scarce scenarios with only a small amount of fine-tuning data. One key reason LLMs can be effectively applied to taxi demand prediction is the structural similarity between time-series demand data and the data on which LLMs are trained ( 5 ). Both data types can be represented as consistent vectors, enabling LLMs to process them using the same model architecture. This commonality allows LLMs to recognize patterns in sequential data, such as taxi demand fluctuations, and predict future trends.

Thus, this study aims to accurately predict taxi demand using LLMs with limited historical data. While research on LLMs in traffic demand prediction is still limited, existing approaches often overlook the road network topology in spatial modeling with LLMs. Specifically, demand in adjacent areas or regions with similar characteristics tends to follow comparable patterns.

To address this, we propose a novel spatio-temporal taxi demand prediction model based on LLMs that integrates the road network topological structure. Firstly, the spatial correlations, including proximity and connectivity between different areas in the road network, are modeled using a graph attention network (GAT). Temporal correlations are modeled at both daily and weekly scales to capture the patterns and trends in taxi demand. Secondly, the powerful Generative Pre-trained Transformer 2 (GPT-2) is employed for the task. A small amount of historical taxi demand data is used to fine-tune this pre-trained model. To make the time-series demand data comprehensible to LLMs, it is transformed into a token embedding format by GPT-2. GPT-2’s inherent reasoning capabilities are then leveraged to uncover complex spatio-temporal dependencies embedded within the tokenized demand data. Finally, to optimize the fine-tuning process and ensure efficient model adaptation, we dynamically adjust the frozen and tunable layers. This approach enhances prediction accuracy with minimal historical data. In summary, this study aims to improve the effectiveness and accuracy of taxi demand prediction, providing a more robust solution for urban transportation management. The main contributions of this study are as follows.

(1) The study proposes an innovative deep learning model that uses LLMs to capture complex spatio-temporal dependencies, enabling accurate taxi demand prediction with minimal historical data.

(2) The study integrates the road network topology into the LLM-based prediction model for spatial correlation modeling using a GAT. Temporal correlations are captured from multiple perspectives, including origin data, daily patterns, and weekly trends.

(3) The fine-tuning process of LLMs adjusts frozen and trainable layers, optimizing the parameters and ensuring high prediction accuracy.

(4) The study conducts experiments on real datasets to evaluate the model’s prediction performance with both large and small amounts of training data. Multi-step experiments are also performed to verify the effectiveness of long-term predictions. The results demonstrate superior performance of the proposed method.

Literature Review

Taxi Demand Prediction

Various advanced technologies have been developed to model spatio-temporal correlations and predict taxi demand. Traditional time-series forecasting methods, such as the autoregressive integrated moving average (ARIMA) and its variants, have been widely employed in the early stages of taxi demand prediction ( 9 ). However, taxi demand data are typically nonlinear, and the linear assumptions of these models limit their prediction accuracy.

To capture nonlinear dependencies in taxi demand data, traditional machine learning techniques such as multilinear regression (MLR) ( 10 ), support vector regression (SVR) ( 11 ), logistic regression, support vector machines (SVMs) ( 12 ), decision trees, hidden Markov models (HMMs) ( 13 ), and back propagation neural networks (BPNNs) ( 14 ) have been explored. While these methods can model some nonlinearities, they struggle to capture deep, long-term patterns and complex relationships in taxi demand. This limitation is especially evident with large-scale, high-dimensional data, requiring more advanced techniques to model both short-term fluctuations and long-term trends.

To overcome these limitations, advanced deep learning techniques have been increasingly employed. These models are effective at capturing complex, nonlinear, and long-term dependencies in time-series data, providing a more powerful framework for taxi demand prediction in dynamic urban environments. For instance, recurrent neural networks (RNNs) ( 15 ) are commonly used for modeling temporal dependencies. Nevertheless, they suffer from gradient vanishing and explosion in long-term predictions. Long short-term memory (LSTM) ( 1 ) networks and gated recurrent units (GRUs) ( 16 ) address these issues, effectively capturing long-term dependencies, with GRUs offering faster training. Moreover, several advanced deep learning approaches, such as attention mechanism-based methods ( 17 , 18 ) and graph neural networks ( 19 ), have been proposed to better capture spatio-temporal dependencies and improve prediction performance.

Traditional methods heavily rely on a large amount of data. For example, Kim et al. ( 10 ) employed a MLR model on NYC taxi data from 2015 to 2017, achieving a root mean square error (RMSE) of 0.1129 and a mean absolute percentage error (MAPE) of 15.04% for single-step prediction. Zhe and Su ( 14 ) proposed a Spark-based optimized BPNN model, reaching a prediction accuracy of 90.2% using 2 years of data (2017–2018) for training and one year (2019) for validation. Liu et al. ( 1 ) applied a modified LSTM framework on NYC taxi data from 2014 and obtained an origin-destination MAPE (OD-MAPE) of 24.93% and an origin MAPE (O-MAPE) of 12.92%. Xu et al. ( 15 ) designed a sequential learning model based on RNNs and mixture density networks, achieving 83% prediction accuracy using NYC taxi data collected between January 2013 and June 2016. Although these models achieved promising results, they heavily rely on large-scale, long-span historical data for training. Data scarcity is a significant challenge in urban sensing applications, particularly in less developed areas with limited or no data collection infrastructure. In many cases, gathering the required data becomes particularly difficult because of high acquisition, storage, and maintenance costs. Moreover, in dynamic environments with constantly changing demand patterns, the lack of comprehensive and up-to-date data makes it even harder to maintain reliable predictions over time. Therefore, accurately predicting taxi demand with limited historical data is essential.

LLMs for Traffic Data Prediction

LLMs have strong learning capabilities, allowing them to learn effectively from a small number of samples. LLMs have recently been employed in traffic prediction. For instance, Huang ( 7 ) used a LLM to process traffic textual information and generate embeddings. These embeddings were combined with historical traffic data. They were then input into traditional spatio-temporal prediction models to explore the potential impact of non-numerical context information, such as special situations or weather, on traffic flow. Guo et al. ( 8 ) proposed a traffic flow prediction model based on LLMs to generate explainable predictions. They designed a structured textual prompt that incorporates multi-modal traffic flow information, facilitating LLMs to capture traffic patterns better. It was the first study to apply LLMs for explainable traffic flow prediction and it demonstrated effective generalization abilities across different traffic flow prediction scenarios without additional training. Ren et al. ( 5 ) proposed a framework named traffic prediction large language model (TPLLM) for traffic prediction based on pre-trained LLMs, to cope with full-sample and few-shot traffic prediction tasks. They designed an embedding module to enable LLMs to understand time-series data and to fuse spatio-temporal features implied within the traffic data. In addition, to reduce training costs and maintain high fine-tuning quality, they applied a cost-effective fine-tuning method, LoRA, to the TPLLM. Their approach effectively supported the development of intelligent transportation systems in areas with limited historical traffic data. Liu et al. ( 20 ) proposed a spatio-temporal LLM for traffic prediction, which defined timesteps at a location as a token and embedded each token with a spatio-temporal embedding layer. Meanwhile, they partially froze the multi-head attention (MHA) to capture global spatio-temporal dependencies between tokens for different traffic prediction tasks. Its framework demonstrated strong performance in both few-shot and zero-shot prediction scenarios. Rong et al. ( 21 ) proposed a lightweight spatio-temporal generative LLM for traffic flow forecasting. Their framework used a spatio-temporal module to capture the spatio-temporal correlations. In addition, they used a parameter transfer strategy to implement the “inference while train” mode, accelerating the training. It not only demonstrated strong prediction performance but also effectively reduced the computational burden of training on large-scale data. De Zarzà et al. ( 22 ) introduced real-time interventions with lightweight LLMs in autonomous driving. By combining a language-and-vision assistant with deep probabilistic reasoning, they improved the system’s real-time responsiveness. Peng et al. ( 23 ) proposed a lane-change LLM that treated the task as a language modeling problem. It predicted lane-change intentions and trajectories, offering explanations to improve interpretability, marking the first use of LLMs for lane-change behavior prediction.

According to the aforementioned literature, LLMs have been widely applied in traffic prediction tasks and have demonstrated strong performance, especially with limited data. However, few studies focus on taxi demand prediction using LLMs. Furthermore, current research on taxi demand prediction using LLMs neglects the adjacency relationships between spatial regions, which are crucial for spatial modeling. As a result, developing a novel LLM-based approach that incorporates road network topology is a significant challenge, requiring solutions for spatial modeling in LLMs and addressing data scarcity.

Methodology

Problem Statement

The purpose of this study is to predict the taxi demand by leveraging historical demand data, which is processed from taxi order records aggregated into specific time intervals. The historical taxi demand is computed by dividing the daily taxi orders into time intervals based on a specified time granularity. For example, when the time granularity is set to 10 min, staring from 7:00 a.m., the first time interval would be from 7:00 to 7:10 a.m., followed by 7:10 to 7:20 a.m., and so on. Since taxi demand is highly correlated across neighboring regions, modeling the spatial dependencies among regions is crucial. The graph structure provides a natural way to represent spatial relationships between regions, and GAT can further capture the dynamic importance of each neighboring region through its attention mechanism.

Definition 1: Traffic network $G$ . The traffic network is represented as a graph $G = (V, E, A)$ , where $V = {v_{1}, v_{2}, \dots, v_{N}}$ is the finite set of nodes (i.e., districts) in the graph. A node represents a specific size grid in this study. Here, $N$ is the number of nodes, $E$ is the finite set of edges in the graph, and $A \in R^{N \times N}$ is the adjacency matrix, which represents the graph structure of the traffic network. The elements of the adjacency matrix $A$ are denoted as $a_{ij}$ , where $a_{ij} = 0$ indicates there is no edge between nodes $v_{i}$ and $v_{j}$ , and $a_{ij} = 1$ denotes the presence of an edge between these nodes.

Definition 2: Taxi demand prediction problem. The historical taxi demand data of all grids at the $t$ th time interval can be expressed as $X_{t} \in R^{N}$ . The taxi demand prediction problem can be expressed as follows: given the traffic network $G$ and taxi demand data of historical $T$ time intervals $X = {X_{t - T + 1}, \dots, X_{t - 1}, X_{t}} \in R^{N \times T}$ , the goal is to establish the mapping function $f$ , and then calculate the taxi demand for the next $T^{'}$ time intervals across all nodes, denoted as $Y^{'} = {Y_{T + 1}^{'}, Y_{T + 2}^{'}, \dots, Y_{T + T^{'}}^{'}} \in R^{N \times T^{'}}$ . Figure 1 illustrates a schematic diagram of the taxi demand prediction problem. It can be expressed by the following mathematical equation:

[Y_{T + 1}^{'}, Y_{T + 2}^{'}, \dots, Y_{T + T^{'}}^{'}] = f (G; X_{t - T + 1}, \dots, X_{t - 1}, X_{t});

(1)

Figure 1.

Schematic diagram of the taxi demand prediction problem.

Overall Framework

To accurately predict taxi demand with limited data, we propose a spatio-temporal taxi demand prediction large model (STTDP-LM) through fine-tuning the LLM GPT-2, as illustrated in Figure 2.

Figure 2.

Framework of the proposed spatio-temporal taxi demand prediction large model (STTDP-LM).

The framework processes taxi demand of all nodes over the past $T$ timesteps alongside an adjacency matrix representing the spatial relationships between nodes. We define the timesteps at each grid of taxi demand data as a token, that is, each token corresponds to a grid’s history taxi demand data. The spatial and temporal modeling components transform the tokens into corresponding embeddings. Specifically, spatial embeddings are obtained using the GAT, which leverages the regional adjacency relationships to model spatial dependencies. Temporal embeddings are constructed by applying a linear transformation to the historical demand values and adding absolute positional encodings at multiple time scales, including hour-of-day and day-of-week patterns. These embeddings are then fused through convolution layers to produce input representations compatible with GPT-2. The fused embeddings are fed into the partially frozen GPT-2 model, which is fine-tuned to learn intricate spatio-temporal patterns for the taxi demand prediction task. Finally, a regression convolution layer predicts taxi demand for each grid over the next $T^{'}$ timesteps.

Graph Attention Network for Spatial Modeling

In real-world traffic networks, there exist hidden spatial correlations between different nodes. These spatial correlations primarily stem from road connectivity, which influences how demand in one area might affect surrounding areas. Nodes that are connected through the road network are likely to have more similar demand patterns. Therefore, capturing these spatial correlations is crucial for enhancing the predictive performance of models.

Because of the complex and dynamic nature of traffic networks, graph-based models provide a natural and flexible framework to represent and model these spatial correlations. To better capture the spatial correlations in the demand data between nodes, this study utilizes the GAT for spatial modeling of the node demand data. The GAT allows the model to dynamically assign different attention weights to neighboring nodes based on their importance in predicting demand. This enables the model to incorporate spatial dependencies in a flexible and adaptive manner.

Firstly, the historical demand data $X$ of each node is linearly transformed to map the original data into a new space, as shown in Equation 2, where $W$ is a learnable parameter matrix and $matmul (X, M)$ denotes matrix multiplication:

MX = matmul (X, M)

(2)

Then, to understand how the nodes interact within the graph and capture the relationship between the demand data of node $i$ and node $j$ , we use the LeakyReLU activation function to add nonlinearity to the attention mechanism to get the attention coefficients $b_{ij}$ between nodes $i$ and $j$ , as shown in Equation 3 where $∥ denotes$ concatenation and $α^{T}$ is a learnable parameter:

b_{ij} = LeakyReLU (α^{T} [W X_{i} ∥ W X_{j}])

(3)

After obtaining the attention coefficients, the adjacency matrix is used to perform the masking operation to ensure that attention is only computed between nodes that are connected in the graph (i.e., where there is an edge between them). The coefficients are then normalized using the softmax function. The equations are as follows:

b_{ij}^{'} = {\begin{matrix} b_{ij}, a_{ij} = 1 \\ - \infty, a_{ij} = 0 \end{matrix}

(4)

c_{ij} = softmax (b_{ij}^{'})

(5)

where $a_{ij}$ represents the adjacency matrix and − $\infty$ is used to indicate an invalid or disconnected node pair. By setting these values to − $\infty$ , the masked pairs are effectively prevented from influencing the softmax normalization.

Finally, the spatial embedding $M_{S}$ for each node is computed by aggregating the features of its neighbors, weighted by the normalized attention coefficients, as shown in Equation 6, where $σ$ is the activation function and $c_{ij}$ are the normalized attention coefficients:

M_{S} = σ (\sum_{j \in N (i)} c_{ij} W X_{j}) \in R^{N \times D}

(6)

Multi-Scale Temporal Modeling

Taxi demand data is influenced by both spatial distribution and temporal variations, such as daily fluctuations and weekly trends. These temporal dynamics are crucial for accurate predictions, as they reflect how demand evolves under different conditions. Therefore, capturing temporal dynamic changes is crucial to improving model accuracy

To model these temporal variations effectively, a linear layer is used to encode the raw input data into separate time embedding at different scales. Embeddings are learned feature representations that map the raw data into fixed-size vectors, capturing underlying patterns and trends over time. In addition to the linear changes in the raw temporal demand data, absolute position encoding is applied to each demand data at “daily” and “weekly” scales. This enables the model to fully capture temporal correlations and patterns across different time intervals. The equations for the multi-time scale embeddings are as follows:

M_{T}^{r} = W_{raw} X

(7)

M_{T}^{d} = W_{day} X (day)

(8)

M_{T}^{w} = W_{week} X (week)

(9)

where $X (day) \in R^{N \times T_{d}}$ and $X (week) \in R^{N \times T_{w}}$ are absolute positional coding at hour-of-day and day-of-week, respectively, and $W_{raw} \in R^{T \times D}$ , $W_{day} \in R^{T_{d} \times D}$ , and $W_{week} \in R^{T_{w} \times D}$ are learnable parameters. By adding these three embeddings, the temporal embedding $M^{T} \in R^{N \times D}$ is obtained:

M_{T} = M_{T}^{r} ∥ M_{T}^{d} ∥ M_{T}^{w}

(10)

Spatio-Temporal Feature Fusion

While both spatial and temporal models extract meaningful features, they each focus on one dimension—spatial or temporal. After spatial modeling and temporal modeling, the extracted features from both domains capture important but distinct aspects of the taxi demand prediction task. Spatial modeling is designed to capture the underlying relationships between different nodes in the traffic network. This modeling provides spatial embeddings that represent how demand is influenced by nearby nodes. However, these spatial embeddings alone are insufficient to fully account for the temporal fluctuations in taxi demand. On the other hand, temporal modeling focuses on capturing the time-varying patterns in taxi demand. By utilizing historical demand data and patterns such as daily fluctuations and weekly trends, temporal modeling generates temporal embeddings. These embeddings reflect how demand changes over time but do not incorporate spatial relationships.

To capture both spatial and temporal dependencies simultaneously, a fusion convolution layer is employed. The fusion convolution layer integrates the spatial and temporal embeddings into a unified representation that can simultaneously account for the interactions between spatial nodes and time-varying demand patterns. It enables the model to learn and represent the complex spatio-temporal relationships about taxi demand. The fusion convolution layer projects the fused feature representation to the required dimensions for further processing by the LLM. The equation for this fusion operation is as follows:

P_{F} = FusionConv (M_{S} ∥ M_{T}; θ)

(11)

where $P_{F} \in R^{N \times 2 D}$ and $θ$ is the learnable parameter of the fusion convolution.

LLM Fine-Tuning and Model Prediction

After spatio-temporal feature fusion, the fused vector is input into the fine-tuning model for further tuning. The base model is GPT-2. In the fine-tuning model, the first $F$ MHA layers are frozen to retain the knowledge already learned in the LLM, and the last $U$ MHA layers are unfrozen to effectively handle the spatio-temporal dependencies in the data.

In the first $F$ layers, freezing the MHA layers ensures that the learned knowledge in the initial layers is preserved. The equations are as follows:

\bar{P^{i}} = MHA (LN (P^{i})) + P^{i}

(12)

P^{i + 1} = FFN (LN (\bar{P^{i}})) + \bar{P^{i}}

(13)

where $i$ ranges from 1 to $F - 1$ and $P^{1} = P_{F}$ . $\bar{P^{i}}$ represents the intermediate representation of the layer after applying the frozen MHA, while $P^{i}$ represents the final representation after layer normalization (LN) and the feed-forward network (FFN).

Details of the internal operation of GPT-2 are shown in Equations 14–18, which are designed to further capture the spatio-temporal dependencies within the tensor output from the fusion convolution layer. The core components include LN, MHA, and the FFN. LN is used to stabilize the learning process by normalizing the activations of each layer. The MHA captures long-range temporal and spatial dependencies through a self-attention mechanism that enables each element in the sequence to attend to all previous elements. The FFN performs nonlinear transformations independently at each position, enhancing the model’s expressive capacity and enabling it to learn complex feature interactions:

LN (P^{i}) = γ ⊙ \frac{P^{i} - μ}{ω} + β

(14)

MHA ({\tilde{P}}^{i}) = W^{O} (hea d_{1} ∥ \dots ∥ hea d_{h})

(15)

hea d_{i} = Attention (W_{i}^{Q} {\tilde{P}}^{i}, W_{i}^{K} {\tilde{P}}^{i}, W_{i}^{V} {\tilde{P}}^{i})

(16)

Attention ({\tilde{P}}^{i}) = softmax (\frac{{\tilde{P}}^{i} {\tilde{P}}^{iT}}{\sqrt{d_{k}}}) {\tilde{P}}^{i}

(17)

FFN ({\hat{P}}^{i}) = \max (0, W_{1} {\hat{P}}^{i} + b_{1}) W_{2} + b_{2}

(18)

where ${\tilde{P}}^{i}$ is the output of $P^{i}$ after passing through the first LN, ${\hat{P}}^{i}$ is the output of ${\bar{P}}^{i}$ after the second LN, $γ$ and $β$ are the learnable parameters, and $μ$ and $ω$ represent the mean and standard deviation, respectively.

In the final $U$ layers, the MHA is unfrozen to capture the spatio-temporal dependencies of the demand data, as shown in Equations 19 and 20, where ${\bar{P}}^{F + U}$ represents the intermediate representation of the layer after applying the unfrozen MHA, while $P^{F + U}$ represents the final representation after LN and the FFN:

{\bar{P}}^{F + U - 1} = MHA (LN (P^{F + U - 1})) + P^{F + U - 1}

(19)

P^{F + U} = FFN (LN ({\bar{P}}^{F + U - 1})) + {\bar{P}}^{F + U - 1}

(20)

After fine-tuning the LLM, the regression convolution (RConv) is used to predict the taxi demand for the following $T^{'}$ timesteps. The equation is as follows, where $Y^{'} \in R^{N \times T^{'}}$ and $δ$ is the learnable parameter of the regression convolution:

Y^{'} = RegressionConv (P^{F + U}; δ)

(21)

The loss function of the ST-LLM is established as follows:

Loss = ∥ Y^{'} - Y ∥ + λ \cdot L

(22)

where $Y^{'}$ is the predicted traffic feature, $Y$ is the ground truth, $L$ represents the L2 regularization term, which helps control overfitting, and $λ$ is a hyperparameter. The whole process of the STTDP-LM is shown in Algorithm 1.

Algorithm 1. The STTDP-LM Framework
Input: Historical taxi demand $X \in R^{N \times T}$ , adjacency matrix $A \in R^{N \times N}$ , all hyperparameters Output: Trained STTDP-LM 1: for each epoch do 2: Shuffle training data 3: for each batch $X$ in training data do 4: // Spatial Embedding via GAT (Eq. 2–6): 5: $MX \leftarrow matmul (X, M)$ 6: for each node $i$ do 7: for each node $j \in N (i)$ do 8: $b_{ij} \leftarrow LeakyReLU (α^{T} [W X_{i} ∥ W X_{j}])$ 9: if $a_{ij} = 1$ then 10: $b_{ij}^{'} \leftarrow b_{ij}$ 11: else 12: $b_{ij}^{'} \leftarrow - \infty$ 13: end if 14: end for 15: $c_{ij} \leftarrow softmax (b_{ij}^{'})$ 16: $M_{S} [i] \leftarrow σ (\sum_{j \in N (i)} c_{ij} W X_{j})$ 17: end for 18: // Temporal embedding (Eq. 7–10): 19: $M_{T}^{r} \leftarrow W_{raw} X$ 20: $M_{T}^{d} \leftarrow W_{day} X (day)$ 21: $M_{T}^{w} \leftarrow W_{week} X (week)$ 22: $M_{T} \leftarrow M_{T}^{r} ∥ M_{T}^{d} ∥ M_{T}^{w}$ 23: // Feature fusion (Eq. 11): 24: $P^{F} \leftarrow FusionConv (M_{S} ∥ M_{T}; θ)$ 25: // LLM fine-tuning (Eq. 12–20): 26: Initialize $P^{1} \leftarrow P^{F}$ 27: for $i = 1$ to $F + U$ do 28: if $i \leq F$ then $▹ Frozen$ GPT-2 layers 29: ${\bar{P}}^{i} \leftarrow MHA (LN (P^{i})) + P^{i}$ 30: $P^{i + 1} \leftarrow FFN (LN ({\bar{P}}^{i})) + {\bar{P}}^{i}$ 31: else $▹ Unfrozen$ GPT-2 layers 32: ${\bar{P}}^{F + U - 1} \leftarrow MHA (LN (P^{F + U - 1})) + P^{F + U - 1}$ 33: $P^{F + U} \leftarrow FFN (LN ({\bar{P}}^{F + U - 1})) + {\bar{P}}^{F + U - 1}$ 34: end if 35: end for 36: // Output and optimization (Eq. 21 and 22): 37: ${\hat{Y}}^{'} \leftarrow RegressionConv (P^{F + U + 1}; δ)$ 38: Compute loss $Loss = ∥ {\hat{Y}}^{'} - Y ∥ + λ \cdot L$ 39: Update parameters via Ranger21 optimizer 40: end for 41: end for

Algorithm 1. The STTDP-LM Framework

Input: Historical taxi demand

X \in R^{N \times T}

, adjacency matrix

A \in R^{N \times N}

, all hyperparameters
Output: Trained STTDP-LM
1: for each epoch do
2: Shuffle training data
3: for each batch

X

in training data do
4: // Spatial Embedding via GAT (Eq. 2–6):
5:

MX \leftarrow matmul (X, M)

6: for each node

i

do
7: for each node

j \in N (i)

do
8:

b_{ij} \leftarrow LeakyReLU (α^{T} [W X_{i} ∥ W X_{j}])

9: if

a_{ij} = 1

then
10:

b_{ij}^{'} \leftarrow b_{ij}

11: else
12:

b_{ij}^{'} \leftarrow - \infty

13: end if
14: end for
15:

c_{ij} \leftarrow softmax (b_{ij}^{'})

16:

M_{S} [i] \leftarrow σ (\sum_{j \in N (i)} c_{ij} W X_{j})

17: end for
18: // Temporal embedding (Eq. 7–10):
19:

M_{T}^{r} \leftarrow W_{raw} X

20:

M_{T}^{d} \leftarrow W_{day} X (day)

21:

M_{T}^{w} \leftarrow W_{week} X (week)

22:

M_{T} \leftarrow M_{T}^{r} ∥ M_{T}^{d} ∥ M_{T}^{w}

23: // Feature fusion (Eq. 11):
24:

P^{F} \leftarrow FusionConv (M_{S} ∥ M_{T}; θ)

25: // LLM fine-tuning (Eq. 12–20):
26: Initialize

P^{1} \leftarrow P^{F}

27: for

i = 1

F + U

do
28: if

i \leq F

then

▹ Frozen

GPT-2 layers
29:

{\bar{P}}^{i} \leftarrow MHA (LN (P^{i})) + P^{i}

30:

P^{i + 1} \leftarrow FFN (LN ({\bar{P}}^{i})) + {\bar{P}}^{i}

31: else

▹ Unfrozen

GPT-2 layers
32:

{\bar{P}}^{F + U - 1} \leftarrow MHA (LN (P^{F + U - 1})) + P^{F + U - 1}

33:

P^{F + U} \leftarrow FFN (LN ({\bar{P}}^{F + U - 1})) + {\bar{P}}^{F + U - 1}

34: end if
35: end for
36: // Output and optimization (Eq. 21 and 22):
37:

{\hat{Y}}^{'} \leftarrow RegressionConv (P^{F + U + 1}; δ)

38: Compute loss

Loss = ∥ {\hat{Y}}^{'} - Y ∥ + λ \cdot L

39: Update parameters via Ranger21 optimizer
40: end for
41: end for

Experiments

Dataset

This study uses taxi trajectory data from Xi’an, covering taxi operations from February 28, 2019, to March 30, 2019. We first converted the coordinate system of the collected trajectory data. Then, we cleaned the data by removing abnormal GPS points and performing map matching using a HMM ( 24 ) to correct GPS drift. After that, we extracted the origin and destination of each trip. We used TransBigData to aggregate the trajectory data with a 10-min time interval. Each trip record was mapped to a predefined 1 km × 1 km spatial grid based on its longitude and latitude. TransBigData is an open-source Python toolkit designed for processing and analyzing spatio-temporal transportation data, such as taxi trajectories, floating car data, and shared bike data. It provides useful tools such as map matching, trajectory cleaning, grid-based spatial division, and origin–destination matrix generation. In addition, the 1-km grid strikes a balance between spatial resolution and computational efficiency. A smaller grid would increase complexity and result in sparse data, while a larger grid would simplify calculations but may not meet the precision requirements for accurate positioning ( 1 , 25 , 26 ).

Figure 3 shows a map of the main urban area of Xi’an before and after grid division. Figure 4 presents the spatial distribution of taxi demand at three time periods on both weekdays and non-working days. In general, the demand hotspots during all three time periods are mostly concentrated in the central area of the map. The color depth of each grid represents the demand level. Darker colors indicate higher demand in that grid. From 07:00 to 08:00, high-demand areas are more concentrated. This suggests that travel patterns during the morning peak are more fixed. Taxi demand mainly appears near residential areas, transport hubs, and commuting routes. On non-working days, the hotspot areas are fewer compared to weekdays. From 12:00 to 13:00, demand is still focused in the central area, but the high-demand zones become more spread out. This may be caused by people going out for lunch or errands, which leads to more scattered activity. From 19:00 to 20:00, the demand distribution is even more dispersed, especially in the central area. This may be related to people engaging in various activities after work.

Figure 3.

Study area: (a) the main urban area of Xi’an; (b) grid division result.

Figure 4.

Distribution of taxi demand space in different periods: (a) weekday (07:00–08:00); (b) weekend (07:00–08:00); (c) weekday (12:00–13:00); (d) weekend (12:00–13:00); (e) weekday (19:00–20:00); (f) weekend (19:00–20:00).

Xincheng and Lianhu districts are selected as the primary research areas because of their size and demand density. In the following sections, Xincheng district will be referred to as district 1, and Lianhu district will be referred to as district 2. The total demand for all days in the dataset for districts 1 and 2 is 1,111,164 and 1,183,826, respectively. The average daily demand per grid in each district is 535 and 530, respectively. Taking district 1 as an example, after grid partitioning, there are 67 grids in total. The historical demand values of these 67 grids are fed into the model for training. The model uses the demand values from the past 12 timesteps of each grid to predict the demand values for the next timestep.

Experiment Settings and Evaluation Metric

To evaluate the effectiveness of the proposed STTDP-LM, a series of well-designed experiments are conducted. The baseline methods include the SVR, LSTM, GRU, GAT, graph convolutional network (GCN), and spatial-temporal graph convolutional network (STGCN). To verify the effectiveness of spatio-temporal correlation modeling, two ablation models are designed based on the large model (STTDP-LM-Spatial, STTDP-LM-Temporal). The STTDP-LM-Spatial only utilizes spatial embeddings, without temporal embeddings, while the STTDP-LM-Temporal employs only temporal embeddings. The STTDP-LM-Spatial and STTDP-LM-Temporal methods are based on the proposed approach and are designed for ablation experiments, so they are not part of the baseline methods.

The dataset is divided into training, validation, and test sets with a 6:2:2 ratio. Both the LSTM and GRU models consist of 64 neurons. We set the historical timesteps $T$ to 12. The future timesteps $T^{'}$ is from 1 to 6 steps, which enables multi-step traffic prediction. Here, $T_{W}$ is set to 7, representing the seven days of a week, while $T_{d}$ is set to 144, where each timestep represents 10 min. For training LLM-based models, we used the Ranger21 optimizer with a learning rate of 0.001. The LLM used is GPT-2 with six layers. For the frozen/unfrozen layers setting, we adjusted the number of unfrozen layers and tested different configurations, selecting the layer number that resulted in the best performance. For the embedding dimension setting, we set it according to the input dimension required by the LLM. Since the input dimension of GPT-2 is required to be 768 dimensions, we set the embedding output dimension of the fusion convolution layer to 768 dimensions. The training process is configured to run for 100 iterations, with a batch size of 64. The experiments are conducted on a system equipped with a single NVIDIA RTX 4080 GPU with 16 GB of dedicated VRAM. The LLM used is GPT-2. All experiments are carried out using Python 3.7 and PyTorch 1.7.1.

The evaluation criteria include the RMSE, MAPE, and mean absolute error (MAE), which are defined by the following equations. In these equations, $y_{i}^{'}$ and $y_{i}$ denote the predicted value and actual value, respectively, and $K$ represents the total number of samples:

RMSE = \sqrt{\frac{1}{K} \sum_{i = 1}^{K} {(y_{i} - y_{i}^{'})}^{2}}

(23)

MAE = \frac{1}{K} \sum_{i = 1}^{K} | y_{i} - y_{i}^{'} |

(24)

MAPE = \frac{1}{K} \sum_{i = 1}^{K} | \frac{y_{i} - y_{i}^{'}}{y_{i}} |

(25)

Overall Prediction Performance

We conduct taxi demand prediction experiments using our proposed methods (STTDP-LM, STTDP-LM-Spatial, STTDP-LM-Temporal) along with several baseline models. The STTDP-LM-Spatial and STTDP-LM-Temporal are variations of the STTDP-LM that use only spatial or temporal components, respectively. Baseline methods include traditional models (ARIMA, SVR) and deep learning models (LSTM, GRU, GAT, GCN, STGCN). The overall prediction results for the next timestep in districts 1 and 2 are summarized in Table 1, with detailed results presented in Figure 5. The values are computed for all test observations by comparing the actual and predicted values.

Table 1.

Prediction Results of Traditional Methods (ARIMA, SVR), Deep Learning Methods (LSTM, GRU, GAT, GCN, STGCN), their Average, and the STTDP-LM

Method	District 1			District 2
Method	RMSE	MAE	MAPE (%)	RMSE	MAE	MAPE (%)
Traditional	2.4018	1.8432	68.53	2.2184	1.6403	57.22
Deep learning	2.6144	1.7447	51.87	2.2283	1.5800	61.53
Average	2.5537	1.7728	56.63	2.2255	1.5972	60.30
STTDP-LM	2.1695	1.4856	47.76	2.0244	1.4498	47.74

Note: ARIMA = autoregressive integrated moving average; SVR = support vector regression; LSTM = long short-term memory; GRU = gated recurrent unit; GAT = graph attention network; GCN = graph convolutional network; STGCN = spatial-temporal graph convolutional network; STTDP-LM = spatio-temporal taxi demand prediction large model; RMSE = root mean square error; MAPE = mean absolute percentage error; MAE = mean absolute error.

Figure 5.

Comparison of prediction results for districts 1 and 2 using different methods: (a) root mean square error (RMSE) results of district 1; (b) RMSE results of district 2; (c) mean absolute error (MAE) results of district 1; (d) MAE results of district 2; (e) mean absolute percentage error (MAPE) results of district 1; (f) MAPE results of district 2.

According to Table 1, the STTDP-LM outperforms the baseline methods across the three evaluation metrics in both districts. Specifically, in district 1, the RMSE, MAE, and MAPE are 2.1695, 1.4856, and 47.76%, respectively. The average RMSE, MAE, and MAPE values of all compared methods are 2.5537, 1.7728, and 56.63%, respectively. The STTDP-LM demonstrates an overall reduction of 15.04%, 16.20%, and 15.66% in these metrics compared to the average values of all baseline methods. In district 2, the RMSE, MAE, and MAPE are 2.0244, 1.4498, and 47.74%, respectively, with the average values for baseline methods being 2.2255, 1.5972, and 60.30%, respectively. The STTDP-LM achieves an overall reduction of 9.04%, 9.23%, and 20.83% in these three metrics. The results of the STTDP-LM-Spatial and STTDP-LM-Temporal indicate that combining spatial and temporal embeddings in the STTDP-LM leads to superior prediction performance. These results demonstrate that the STTDP-LM can accurately predict future demand using only 2 h of historical data. This capability makes it highly applicable to ride-hailing or taxi dispatching systems. By leveraging short-term historical data for high-accuracy forecasting, the model enables more efficient vehicle allocation, reduces passenger wait times, and minimizes idle driving rates.

Figure 6 shows the effectiveness of incorporating spatial and temporal modeling into the LLM. As depicted, the model that includes both components (STTDP-LM) consistently outperforms its variants, the STTDP-LM-Spatial and STTDP-LM-Temporal. Specifically, when both spatial and temporal modeling are incorporated into GPT-2, the RMSE, MAE, and MAPE values are the lowest, indicating better prediction accuracy. In contrast, using only one of the components results in slightly lower performance. This highlights that simultaneously accounting for both spatial and temporal modeling enhances prediction accuracy.

Figure 6.

Comparison of prediction results for districts 1 and 2 with consideration of spatial and temporal modeling: (a) root mean square error (RMSE); (b) mean absolute error (MAE); (c) mean absolute percentage error (MAPE).

Prediction Performance with Limited Data

To evaluate the prediction performance with limited data of the STTDP-LM, two experiments are designed and conducted. In the first experiment, the STTDP-LM is trained with datasets having different training ratios (1%, 5%, 10%, 15%, 20%, 25%, 30%); the results are presented in Figure 7. In the second experiment, all methods are compared using only 1% of the training data; the results are displayed in Figure 8. Limited data refers to the temporal dimension and data volume dimension. With respect to the temporal dimension, the proposed model predicts future demand based on the actual demand values from the past 12 time slices, meaning it only uses data from the past 2 h. With respect to data volume, we fine-tuned the model using 1%, 5%, 10%, and 30% of the training data, respectively.

Figure 7.

Spatio-temporal taxi demand prediction large model prediction results for districts 1 and 2 with datasets having different training ratios (1%, 5%, 10%, 15%, 20%, 25%, 30%): (a) root mean square error (RMSE); (b) mean absolute error (MAE); (c) mean absolute percentage error (MAPE).

Figure 8.

Comparison of prediction results for districts 1 and 2 using different methods with 1% of the training data: (a) root mean square error (RMSE); (b) mean absolute error (MAE); (c) mean absolute percentage error (MAPE).

According to Figure 7, the findings indicate a clear trend: as the proportion of training data increases, the RMSE, MAE, and MAPE values generally decrease, demonstrating improved prediction accuracy. For instance, with 1% of the training data, the RMSE, MAE, and MAPE results in district 1 are 3.0384, 1.9043, and 55.19%, respectively. In district 2, the corresponding values are 2.7186, 1.8277, and 58.80%, respectively. These results outperform the baselines in both districts, as shown in Figure 8. The compared methods are evaluated using training, validation, and testing sets with a 6:2:2 split ratio. When the training data increases to 20%, the prediction results in district 1 improve, with RMSE, MAE, and MAPE values of 2.2901, 1.5471, and 50.22%, respectively. The results surpass the performance of baselines that use all training data, including the ARIMA, LSTM, GRU, GAT, GCN, and STGCN. In district 2, the corresponding results are 2.1271, 1.5047, and 51.10%, outperforming the baseline methods, including the ARIMA, LSTM, GRU, GAT, and GCN.

As shown in Figure 8, the proposed method achieves significantly higher prediction accuracy compared to all baseline methods, even with a very small proportion of training data (1%). When the training samples are significantly reduced, the comparison methods exhibit a much higher increase in RMSE, MAE, and MAPE values compared to the STTDP-LM. Notably, the MAPE value of the ARIMA exceeds 1000, which is an invalid result. Excluding the ARIMA, other comparison methods show increases in the average values of 43.17%, 42.88%, and 21.11%, in district 1, and 47.89%, 41.78%, and 4.98% in district 2. In contrast, the STTDP-LM demonstrates a much smaller increase in the average values of the three indicators, with 28.60%, 21.99%, and 13.46% in district 1, and 25.54%, 20.68%, and 18.81% in district 2.

In summary, the findings highlight the STTDP-LM’s ability to recognize complex patterns from limited data, achieving prediction results comparable to those obtained with sufficient training data. The reason is that traditional and deep learning models rely heavily on the available training data, making them more prone to overfitting when data is sparse. Taxi demand data is often sparse and noisy, especially in certain regions or times. The STTDP-LM uses pre-trained LLMs that have learned broad patterns from large datasets. This allows it to generalize effectively, even with limited taxi demand data. In early deployment scenarios of newly developed areas, where only minimal historical data is available, supervised learning methods often struggle to build effective predictors. In contrast, the STTDP-LM can deliver relatively stable prediction results even under limited data conditions. This makes it particularly valuable for supporting intelligent transportation planning and vehicle dispatch during the early operational stages of urban expansion or redevelopment.

Multi-Step Prediction Performance

Because of the increased complexity of long-term temporal dynamics and the challenges associated with extrapolating historical patterns, multi-step prediction is inherently more uncertain and difficult than short-term prediction. To evaluate the performance of long-term predictions, this study compares the prediction results of the STTDP-LM with several comparable models over the next six time steps. Specifically, the SVR and STGCN are selected as representative models for traditional methods and graph neural networks, respectively. The LSTM and GRU are chosen as representative RNN methods for multi-step prediction. The comparison results for multi-step prediction are presented in Tables 2 and 3.

Table 2.

Multi-Step Prediction Results of District 1

Method	10 min			20 min			30 min
Method	RMSE	MAE	MAPE (%)	RMSE	MAE	MAPE (%)	RMSE	MAE	MAPE (%)
SVR	2.1864	1.5019	55.53	2.2137	1.5163	55.99	2.2368	1.5294	56.43
LSTM	2.9048	1.9504	54.40	2.9032	1.9345	52.51	2.9235	1.9506	53.08
GRU	2.9116	1.9399	52.36	2.9147	1.9401	52.79	2.8784	1.9405	53.13
STGCN	2.2765	1.5679	53.69	2.2934	1.5797	54.64	2.3255	1.5822	52.73
Average	2.5698	1.7400	54.00	2.5813	1.7427	53.98	2.5911	1.7507	53.84
STTDP-LM	2.1695	1.4856	47.76	2.1640	1.4898	48.58	2.1628	1.4916	48.97
	40 min			50 min			60 min
SVR	2.2553	1.5406	56.81	2.2727	1.5508	57.17	2.2882	1.5609	57.55
LSTM	2.9692	1.9835	53.56	2.9473	1.9786	53.48	3.0354	2.0210	54.40
GRU	2.9147	1.9685	53.28	2.9459	1.9948	54.89	3.0244	2.0252	54.39
STGCN	2.3516	1.5905	51.61	2.3538	1.6207	55.43	2.4161	1.6935	62.19
Average	2.6227	1.7708	53.82	2.6299	1.7862	55.24	2.6910	1.8252	57.13
STTDP-LM	2.1898	1.4904	47.44	2.1481	1.4761	47.59	2.1929	1.4986	46.19

Note: SVR = support vector regression; LSTM = long short-term memory; GRU = gated recurrent unit; STGCN = spatial-temporal graph convolutional network; STTDP-LM = spatio-temporal taxi demand prediction large model; RMSE = root mean square error; MAPE = mean absolute percentage error; MAE = mean absolute error.

Table 3.

Multi-Step Prediction Results of District 2

Method	10 min			20 min			30 min
Method	RMSE	MAE	MAPE (%)	RMSE	MAE	MAPE (%)	RMSE	MAE	MAPE (%)
SVR	2.0714	1.4641	56.27	2.0942	1.4768	56.78	2.1110	1.4869	57.18
LSTM	2.1863	1.5610	51.70	2.1728	1.5564	50.99	2.1875	1.5684	51.45
GRU	2.1876	1.5707	51.11	2.2161	1.5882	51.67	2.2419	1.6033	51.77
STGCN	2.0299	1.5016	56.34	2.1899	1.5432	53.56	2.2243	1.5471	51.74
Average	2.1188	1.5244	53.86	2.1683	1.5412	53.25	2.1912	1.5514	53.04
STTDP-LM	2.0244	1.4498	47.74	2.0268	1.4485	47.98	2.0296	1.4485	47.52
	40 min			50 min			60 min
SVR	2.1285	1.4976	57.60	2.1445	1.5077	58.04	2.1596	1.5177	58.48
LSTM	2.2023	1.5816	51.35	2.1968	1.5820	52.10	2.2610	1.6245	52.64
GRU	2.2045	1.5937	52.52	2.2006	1.5899	52.12	2.2802	1.6419	53.18
STGCN	2.2265	1.5540	52.87	2.2284	1.5507	51.65	2.2543	1.5863	54.99
Average	2.1905	1.5567	53.59	2.1926	1.5576	53.48	2.2388	1.5926	54.82
STTDP-LM	2.0353	1.4564	48.13	2.0270	1.4498	47.71	2.0333	1.4605	49.45

The results show that the STTDP-LM outperforms other methods in long-term prediction. In district 1, for a two-step prediction (20 min), the STTDP-LM achieves average reductions of 16.17%, 14.51%, and 10.00% in the RMSE, MAE, and MAPE, respectively. For a six-step prediction (60 min), it achieves reductions of 18.51%, 17.89%, and 19.15%. Similarly, in district 2, for a two-step prediction, the STTDP-LM shows average reductions of 6.53%, 6.02%, and 9.90% in the RMSE, MAE, and MAPE, respectively. For a six-step prediction, it achieves reductions of 9.18%, 8.29%, and 9.80%, respectively.

As the prediction time increases, prediction errors generally grow. With each prediction step, the accumulated error increases. In addition, the nonlinearity and fluctuations in the demand data become more complex. As shown in the graph, the STTDP-LM’s error grows more slowly than the comparison methods as the prediction step increases. The difference becomes especially noticeable after the fifth step. This improved performance is because of the prior knowledge learned from extensive LLM training. Furthermore, the STTDP-LM enhances the model’s ability to capture long-term dependencies in taxi demand data.

Model Parameter Analysis

For the proposed model, the parameter $U$ plays a crucial role in determining the number of unfrozen layers during the fine-tuning phase of LLMs. LLMs have strong few-shot learning ability because they were pre-trained on large-scale data. We freeze some layers to keep the model’s general knowledge. Then we unfreeze the last U layers to help the model learn new knowledge. This makes it easier to adapt LLMs to our specific task. In the taxi demand prediction task, the unfrozen U layers help the model learn the patterns of demand changes over time and space. This improves the prediction accuracy even when the training data is limited. Table 4 illustrates the prediction results with varying values of $U$ . According to the experimental results, the prediction results may fluctuate depending on the setting of the unfrozen layer $U$ . Compared to the baselines, the overall results are more accurate. In the case of district 2, the performance in the RMSE and MAE improves as $U$ increases up to 5, suggesting that unfreezing more layers up to a certain limit can enhance performance. For district 1, the overall results of the three indicators are better when the number of unfrozen layers is 4.

Table 4.

Prediction Results with Varying Values of Unfrozen Layers U

Unfrozen layers	District 1			District 2
Unfrozen layers	RMSE	MAE	MAPE (%)	RMSE	MAE	MAPE (%)
U = 0	2.2015	1.4960	46.53	2.0580	1.4662	48.24
U = 1	2.1823	1.4931	48.04	2.0525	1.4614	47.75
U = 2	2.1777	1.5016	49.66	2.0427	1.4630	49.65
U = 3	2.1982	1.4964	46.78	2.0339	1.4592	48.82
U = 4	2.1695	1.4856	47.76	2.0205	1.4579	50.45
U = 5	2.1692	1.4882	49.40	2.0244	1.4498	47.74
U = 6	2.1731	1.4935	49.82	2.0400	1.4551	48.46

Note: RMSE = root mean square error; MAPE = mean absolute percentage error; MAE = mean absolute error.

The reason for this is that the pre-trained weights of the model are typically trained on large amounts of data. Generally, freezing the lower layers of the model (such as the first few layers of a transformer) is preferred, as these layers tend to learn more general features. The unfrozen layers, on the other hand, update their parameters during fine-tuning, allowing the model to learn features that are more specific to the task of demand prediction. The frozen layers remain unchanged to preserve the pre-trained knowledge and avoid its loss during fine-tuning, while also reducing computational overhead and training time. In this experiment, the STTDP-LM has six layers in total. The number of unfrozen layers is set to 4 for district 1 and 5 for district 2 to adjust the model’s parameters and improve its adaptation to the demand prediction task.

Conclusions

This study presents a novel taxi demand prediction approach using GPT-2 to capture complex spatio-temporal dependencies, even with limited historical data. The model integrates road network topology and employs fine-tuning techniques to improve prediction accuracy. Experimental results show that the STTDP-LM outperforms traditional and deep learning models, achieving significant reductions in key metrics. The importance of both spatial and temporal modeling is also verified. The STTDP-LM performs well even with very limited training data (1%). It also shows strong capabilities in multi-step prediction. This makes it especially useful in real-world applications where data is scarce. In conclusion, the STTDP-LM provides an effective and robust solution for taxi demand prediction, particularly in areas with limited historical data.

Spatial and temporal dependencies are effectively captured using GATs and multi-scale modeling. However, adding factors such as weather, holidays, and other contextual influences could improve the model further. Future work will focus on integrating these contextual features into a spatio-temporal large model, particularly for non-typical demand patterns and more dynamic environments. In addition, we also plan to explore and compare the performance impacts of integrating other large models into our framework.

Footnotes

Author Contributions

The authors confirm contribution to the paper as follows: study conception and design: J. Chen, C. Yuan; data collection: Y. Wang, R. Li, C. Zong; analysis, and interpretation of results: M. Zhang, Z Chen; draft manuscript preparation: J. Chen, Y. Wang. All authors reviewed the results and approved the final version of the manuscript.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research was supported by the Scientific Innovation Practice Project of Postgraduates of Chang’an University (Grant No. 300103725056) and the Natural Science Basic Research Plan in Shaanxi Province of China (Grant No. 2021JC-27).

ORCID iDs

Jing Chen

Ruimin Li

Changming Zong

References

Liu

Qiu

Wang

Ouyang

Lin

Contextualized Spatial–Temporal Network for Taxi Origin-Destination Demand Prediction. IEEE Transactions on Intelligent Transportation Systems, Vol. 20, No. 10, 2019, pp. 3875–3887.

Yao

Tang

Jia

Gong

Li.

Deep Multi-View Spatial-Temporal Network for Taxi Demand Prediction. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32, No. 316, 2018, pp. 2588–2595.

Zhang

Zhu

Wang

F.-Y.

MLRNN: Taxi Demand Prediction Based on Multi-Level Deep Learning and Regional Heterogeneity Analysis. IEEE Transactions on Intelligent Transportation Systems, Vol. 23, No. 7, 2021, pp. 8412–8422.

Saxena

Cao

Multimodal Spatio-Temporal Prediction with Stochastic Adversarial Networks. ACM Transactions on Intelligent Systems and Technology (TIST), Vol. 13, No. 2, 2022, pp. 1–23.

Ren

Chen

Liu

Wang

Cui

TPLLM: A Traffic Prediction Framework Based on Pretrained Large Language Models. arXiv Preprint arXiv:2403.02221, 2024.

Xia

Tang

Shi

Xia

Yin

Huang

UrbanGPT: Spatio-Temporal Large Language Models. Proc., 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, Association for Computing Machinery, New York, 2024, pp. 5351–5362.

Huang

Enhancing Traffic Prediction with Textual Data Using Large Language Models. arXiv Preprint arXiv:2405.06719, 2024.

Guo

Zhang

Jiang

Peng

Zhu

Yang

H. F.

Towards Explainable Traffic Flow Prediction with Large Language Models. Communications in Transportation Research, Vol. 4, 2024, p. 100150.

Moreira-Matias

Gama

Ferreira

Mendes-Moreira

Damas

Predicting Taxi–Passenger Demand Using Streaming Data. IEEE Transactions on Intelligent Transportation Systems, Vol. 14, No. 3, 2013, pp. 1393–1402.

10.

Kim

Sharda

Zhou

Pendyala

R. M.

A Stepwise Interpretable Machine Learning Framework Using Linear Regression (LR) and Long Short-Term Memory (LSTM): City-Wide Demand-Side Prediction of Yellow Taxi and For-Hire Vehicle (FHV) Service. Transportation Research Part C: Emerging Technologies, Vol. 120, 2020, p. 102786.

11.

Zhang

Short-Term Traffic Flow Prediction Based on Incremental Support Vector Regression. Proc., Third International Conference on Natural Computation (ICNC 2007), Vol. 1, IEEE, New York, 2007, pp. 640–645.

12.

Castro-Neto

Jeong

Y.-S.

Jeong

M.-K.

Han

L. D.

Online-SVR for Short-Term Traffic Flow Prediction Under Typical and Atypical Traffic Conditions. Expert Systems with Applications, Vol. 36, No. 3, 2009, pp. 6164–6173.

13.

Alvarez-Garcia

J. A.

Ortega

J. A.

Gonzalez-Abril

Velasco

Trip Destination Prediction Based on Past GPS Log Using a Hidden Markov Model. Expert Systems with Applications, Vol. 37, No. 12, 2010, pp. 8166–8171.

14.

Zhe

Taxi Demand Prediction Model Based on Spark and Improved BP Neural Network. Frontiers of Data and Domputing, Vol. 5, No. 4, 2023, pp. 112–126.

15.

Rahmatizadeh

Bölöni

Turgut

Real-Time Prediction of Taxi Demand Using Recurrent Neural Networks. IEEE Transactions on Intelligent Transportation Systems, Vol. 19, No. 8, 2017, pp. 2572–2581.

16.

Fathi

Balali

A Ride-Hailing Company Supply Demand Prediction Using Recurrent Neural Networks, GRU and LSTM. Proc., Science and Information Conference, Springer, Cham, 2024, pp. 123–133.

17.

Zhou

Chen

A Spatiotemporal Attention Mechanism-Based Model for Multi-Step Citywide Passenger Demand Prediction. Information Sciences, Vol. 513, 2020, pp. 372–385.

18.

Fang

Liu

Efficient Multi-Step Prediction Model That Considers the Influence of Spatial and Temporal Factors on Ride-Hailing Demand. Transportation Research Record: Journal of the Transportation Research Board, 2025. 2679: 03611981241287192.

19.

Wang

Xie

Zhao

Quick Taxi Route Assignment via Real-Time Intersection State Prediction with a Spatial-Temporal Graph Neural Network. Transportation Research Part C: Emerging Technologies, Vol. 158, 2024, p. 104414.

20.

Liu

Yang

Long

Zhao

Spatial-Temporal Large Language Model for Traffic Prediction. arXiv Preprint arXiv:2401.10134, 2024.

21.

Rong

Mao

Chen

Large-Scale Traffic Flow Forecast with Lightweight LLM in Edge Intelligence. IEEE Internet of Things Magazine. https://doi.org/10.1109/IOTM.001.2400047

22.

de Zarzà i Cubero

de Curtò i Díaz

Roig

Calafate

C. T.

LLM Multimodal Traffic Accident Forecasting. https://doi.org/10.3390/s23229225

23.

Peng

Guo

Chen

Zhu

Chen

Wang

, et al. LC-LLM: Explainable Lane-Change Intention and Trajectory Predictions with Large Language Models. arXiv Preprint arXiv:2403.18344, 2025.

24.

Eddy

S. R.

Hidden Markov Models. Current Opinion in Structural Biology, Vol. 6, No. 3, 1996, pp. 361–365.

25.

Zhang

Wang

Shan

Zhou

Wang

CMT-Net: A Mutual Transition Aware Framework for Taxicab Pick-Ups and Drop-Offs Co-prediction. Proc., Fifteenth ACM International Conference on Web Search and Data Mining, Association for Computing Machinery, New York, 2022, pp. 1406–1414.

26.

Goto

Matsumoto

Rizk

Yanai

Yamaguchi

Privacy-Preserving Taxi-Demand Prediction Using Federated Learning. Proc., 2023 IEEE International Conference on Smart Computing (SMARTCOMP), IEEE, New York, 2023, pp. 297–302.