SOFTS++: Fast and accurate linear model for multivariate long-term time series forecasting

Abstract

In recent years, Transformer-based models have dominated the field of long-term time series forecasting. However, the quadratic complexity of attention mechanisms makes both training and inference computationally expensive. The SOFTS model has emerged as an efficient alternative, replacing attention mechanisms with the STAR module to preserve linear complexity while achieving performance comparable to or better than competing approaches. The SOFTS model builds on the iTransformer architecture, which marked a significant advancement in long-term time series forecasting. Although neither iTransformer nor SOFTS incorporates positional embeddings, our analysis revealed a clear opportunity to improve forecasting accuracy by introducing them. However, the straightforward inclusion of positional embeddings leads to convergence and generalization issues. To address this, we propose a simple yet effective technique: during training, positional embeddings are randomly omitted in certain forward passes, which reduces instability and helps the model generalize better. We refer to this novel form of using positional embeddings as Learnable Stochastic Positional Embedding. Additionally, we incorporate multiple dropout layers to mitigate overfitting and improve accuracy. These modifications result in SOFTS++, a fast and accurate model that achieves the best performance on at least 10 out of 12 standard benchmark datasets. By maintaining linear complexity and requiring minimal computational resources, SOFTS++ stands out as a capable and resource-efficient method for multivariate long-term forecasting tasks.

Keywords

Time series forecasting linear complexity positional embeddings dropout layers regularization techniques

1. Introduction

Multivariate time series forecasting has traditionally relied on statistical methods such as ARIMA and Exponential Smoothing.^1–3 However, with the exponential growth of data and advances in computational power, deep learning models, particularly those based on the Transformer architecture, have begun to dominate this field.^4–6 Transformers have proven particularly effective due to their ability to capture long-range temporal dependencies and have found applications not only in forecasting, but also in classification, imputation, and anomaly detection tasks.^7–9

Despite their success, the quadratic complexity of the attention mechanism poses a significant computational bottleneck,¹⁰ especially for long input sequences. While early Transformer models such as Informer and Autoformer brought notable improvements,^11,12 studies have shown that even simple linear models can achieve comparable or better performance with significantly lower computational costs.¹³ This has led to a re-evaluation of architectural choices and a renewed interest in designing lightweight models that do not rely on self-attention.

One such advancement is the SOFTS model,¹⁴ which replaces the attention mechanism with the STAR (STar Aggregate-Redistribute) module, reducing complexity to a linear scale while maintaining strong performance. STAR aggregates information into a central representation and applies stochastic pooling,¹⁵ striking a balance between efficiency and accuracy.

In this paper, we build upon the SOFTS architecture and propose SOFTS++, a fast and accurate model designed for multivariate long-term forecasting. Although the original SOFTS and its predecessor iTransformer omit positional embeddings, we identify that their careful inclusion can lead to further performance gains. However, naively adding positional embeddings introduces convergence issues.

To address this, SOFTS++ employs a simple yet effective technique where positional embeddings are selectively applied during training, improving generalization and stability. Furthermore, we introduce multiple dropout layers to mitigate overfitting. With these modifications, SOFTS++ achieves state-of-the-art performance on 10 out of 12 benchmark datasets, while preserving linear complexity and requiring minimal computational overhead. This work complements our previous study about HASPFormer model, where we investigated attention-based architectures for time series forecasting. In contrast, the current study focuses on lightweight linear models, aiming to assess their efficiency, robustness, and performance in long-term forecasting tasks.

In this paper, we first review the most relevant papers in the field of multivariate time series forecasting in Section 2. In Section 3, we provide the theoretical background by outlining the fundamental components and techniques underlying the models and experiments in this study. In Section 4, we detail the structure and innovations of the original STAR and proposed STAR++ module, emphasizing improvements in robustness and generalization. Section 5 presents extensive experimental results on 12 benchmark datasets, comparing SOFTS++ with competitive baselines in terms of accuracy and robustness. Finally, Section 6 concludes the paper and outlines directions for future research.

2. Related work

Time series forecasting has evolved significantly in recent years. It began with traditional statistical models such as ARIMA and Exponential Smoothing, followed by various deep learning approaches, especially recurrent neural networks (RNNs) like LSTM¹⁶ and GRU,¹⁷ and more recently, Transformer-based models have become increasingly prominent. Statistical models offer efficiency and interpretability,¹⁸ but their performance in complex patterns is limited due to strong assumptions about the data generation process, the dependence on linear formulations and the limited capacity to learn from the data,^19,20 while machine learning approaches can capture non-linear dynamics more effectively. Transformer models initially generated enthusiasm due to their success in natural language processing,^21,22 but early versions, such as the vanilla Transformer, were later found to be inefficient for time series tasks due to high computational cost and poor scalability.

To address these limitations, numerous specialized Transformer variants have been proposed. Informer¹² introduced ProbSparse self-attention to reduce complexity and improve long-sequence forecasting. Autoformer¹¹ tackled periodicity via a decomposition block and auto-correlation mechanism. PatchTST⁶ leveraged subseries-level patching and channel independence to extend receptive fields without increasing computation. Crossformer²³ extended forecasting capabilities by modelling both temporal and inter-variable dependencies using DSW embeddings and Two-Stage Attention. A different approach came with TSMixer,²⁴ which used simple MLPs to outperform attention-based models in some cases, reviving interest in linear architectures.

Further innovation came with iTransformer,⁵ which restructured the attention mechanism by embedding entire time series into tokens. It demonstrated improved efficiency and generalization in multivariate forecasting. Similarly, SOFTS¹⁴ introduced the STAR module, centralizing information aggregation across channels to reduce complexity while maintaining performance.

FEDformer²⁵ combines seasonal-trend decomposition with attention in the frequency domain, using a compact set of Fourier components to capture key patterns efficiently. This design improves accuracy and scalability for long-sequence forecasting. DLinear and NLinear¹³ renewed interest in linear modelling by demonstrating that simple channel-independent linear projections can match or even outperform attention-based architectures in long-horizon time series forecasting. In addition, KAN-based approaches²⁶ have proven to be to be effective for multivariate time series forecasting,²⁷ as demonstrated through extensive experiments and visualizations.

3. Theoretical background

In this section, we outline the mathematical and theoretical foundations of the main components of SOFTS++, focusing on stochastic pooling, dropout layers, and various positional embedding mechanisms (Figures 1 and 2). The embeddings include the classical fixed sinusoidal formulation, the proposed learnable stochastic variant, as well as rotary, relative, and hybrid approaches, which are considered for comparative evaluation in our experiments.

Figure 1.

Original STAR module architecture. Image re-drawn from original paper.¹⁴

Figure 2.

Architecture of the proposed STAR++ module. If the learnable threshold ( $uniform (0, 1) > τ$ ) condition is satisfied, the input sequence is enhanced with positional embedding, after which it undergoes aggregation, stochastic pooling, and fusion with the original features. The output is redistributed through MLP layers and projected to generate the final forecast. Red arrows indicate conditional data flow, while solid and dashed lines denote different processing stages. The bottom black arrow represents the residual connection.

3.1. Stochastic pooling

Stochastic pooling¹⁵ is employed as an alternative to the attention mechanism whenever the learnable threshold guides the model towards local generalization. The implementation here follows the approach described in Han et al.¹⁴ Unlike pooling strategies that combine maximum and average values, stochastic pooling selects a single activation from each pooling region at random, with a bias towards higher values. The selection probabilities are obtained by applying a softmax over the activations, introducing controlled randomness that makes larger activations more likely to be chosen.

Formally, for a pooling region $j$ , the probability of selecting channel $i$ is given by:

p_{i j} = \frac{e^{A_{i j}}}{\sum_{k = 1}^{C} e^{A_{k j}}}

(1)

where

A_{i j}

is the activation of channel

i

at temporal step

j

. During training, once the probabilities

p

are computed, the pooled output for region

j

is:

o_{j} = A_{c j} where c \sim P (p_{1 j}, p_{2 j}, \dots, p_{C j})

(2)

This probabilistic selection mechanism adds beneficial variability and encourages the model to learn more generalizable representations.

3.2. Dropout layers

Dropout (presented in Srivastava et al.²⁸) is applied throughout the architecture as a regularization technique to reduce overfitting in high-capacity models such as attention-based networks. It operates by randomly disabling a proportion of neurons during each forward pass in training, preventing the network from relying too heavily on specific features. The dropout operation is defined as:

Dropout (x) = x ⊙ mask

(3)

where

⊙

denotes element-wise multiplication, and

mask

is a binary tensor with elements set to zero with probability

p

and to one with probability

1 - p

3.3. Positional embedding

Transformers, unlike recurrent architectures, do not inherently encode sequence order. To introduce positional information, Positional Embeddings (PE) are added to the input feature vectors.²⁹ The encoding is defined as:

P E_{(p o s, k)} = {\begin{cases} \sin (\frac{p o s}{10000^{\frac{2 i}{d_{model}}}}), & k = 2 i \\ \cos (\frac{p o s}{10000^{\frac{2 i}{d_{model}}}}), & k = 2 i + 1 \end{cases}

(4)

where:

$p o s$ is the position index in the sequence,

$i$ is the embedding dimension index,

$d_{model}$ is the hidden dimension size.

This positional encoding enables the model to represent both absolute and relative positions, which is crucial for capturing temporal dependencies in time series forecasting.

3.4. Positional embedding variants

Relative positional embedding (RPE) encodes the relative distances between positions rather than their absolute indices, enabling better generalization to sequences longer than those seen during training. This technique modifies the attention score computation by adding distance-dependent biases, reducing the reliance on fixed sequence length and improving performance in tasks requiring variable-length context modelling. In experiments, we adapt relative positional embedding by adding distance-dependent representations directly to the input features, without modifying attention scores, providing a simplified mechanism to incorporate relative position information. Rotary positional embedding (RoPE) incorporates rotations in the complex plane for each pair of dimension of the feature, multiplying the positional information in the attention of the dot product.³⁰ In this paper, we reformulate Rotary Positional Embedding (RoPE) as a direct rotation of feature dimension pairs using sinusoidal functions, since our architecture does not employ attention mechanisms. Learnable positional embedding treats position indices as regular embedding vectors that are learned during training, without relying on predefined functional forms.³¹ The advantage is flexibility and potential adaptation to specific data distributions.

4. Presented module

In this section, we build upon the SOFTS model,¹⁴ which introduced the STAR module as a lightweight and effective alternative to traditional attention mechanisms for multivariate time series forecasting. The STAR module replaces costly pairwise channel interactions with a centralized aggregation-redistribution strategy, significantly reducing complexity while improving robustness in noisy settings.

To lay the groundwork for our proposed improvements, we first revisit the original STAR module in detail, explaining its architectural design and advantages. We then introduce STAR++, our enhanced version that forms the core of the SOFTS++ architecture. STAR++ builds on the foundational concepts of STAR, introducing several novel mechanisms, including learnable positional embedding control and multiple dropout layer, that together further improve generalization and efficiency.

Through this progression, we aim to demonstrate how centralized interaction mechanisms can be incrementally enhanced to deliver state-of-the-art performance on challenging long-term forecasting benchmarks, without compromising computational scalability.

4.1. STAR Module: STar Aggregate-Redistribute

A core contribution of the SOFTS architecture is the integration of the STAR module (STar Aggregate-Redistribute) which provides an efficient and robust mechanism for modelling inter-channel dependencies in multivariate time series. Unlike traditional distributed interaction schemes such as self-attention, which compute pairwise relationships among all channels and suffer from quadratic complexity, STAR employs a centralized interaction strategy. This significantly reduces computational overhead while improving generalization in the presence of noisy or anomalous channels.

Centralized interaction via global core. The STAR module operates by first aggregating representations across all input channels to construct a shared latent representation referred to as the core. This core serves as a central bottleneck that captures global context and facilitates indirect communication between channels. Formally, given input series representations $S \in R^{C \times d}$ for $C$ channels and hidden dimension $d$ , the core representation $o \in R^{d^{'}}$ is computed as:

o = StochPool ({MLP}_{1} (S))

Here,

{MLP}_{1}

is a two-layer perceptron with GELU activation,³² and Stochastic Pooling is used instead of conventional mean or max pooling. This pooling strategy introduces stochasticity during training by sampling activations based on softmax probabilities, and uses a weighted average during inference, enhancing robustness and preventing overfitting to outliers.

Fusion of core and local representations. Once the core is computed, it is repeated and concatenated with each channel’s local representation to form a fused tensor:

F = RepeatConcat (S, o)

The fusion tensor

F \in R^{C \times (d + d^{'})}

is then processed through another MLP to integrate the core context back into the local channel features:

S^{'} = {MLP}_{2} (F) + S

This formulation not only ensures efficient cross-channel communication, but also preserves channel-local information via a residual connection.

Advantages over distributed mechanisms. The key benefit of STAR’s centralized topology is its linear complexity with respect to the number of channels $C$ , in contrast to the $O (C^{2})$ complexity of attention-based architectures. Additionally, by decoupling the representation of each channel from direct pairwise interactions, STAR is more robust to anomalies, a frequent occurrence in real-world time series data. Empirical results confirm that replacing attention with STAR in models such as iTransformer, PatchTST, and Crossformer either preserves or improves forecasting performance, while reducing resource consumption.

4.2. STAR++ Module: Enhanced STar Aggregate-Redistribute

To further improve the original STAR module’s accuracy, we propose an enhanced version named STAR++. This module extends STAR by integrating two key improvements: (1) optional positional embedding controlled by a learnable threshold and (2) multiple dropout layers for regularization.²⁸ STAR++ preserves the centralized interaction paradigm of STAR, while improving generalization and resilience to noise.

Stochastic learnable positional embedding. While the original STAR module omits positional embeddings, STAR++ optionally integrates sinusoidal (i.e., sine and cosine-based) positional embeddings. To avoid convergence issues associated with fixed positional embeddings, we introduce a learnable scalar threshold $τ \in [0, 1]$ . Positional information is injected only when a Bernoulli trial with probability $τ$ succeeds. This mechanism stochastically regulates the inclusion of positional context and enhances the stability of the training. As training progresses, epoch by epoch and batch by batch, the parameter is continuously updated, gradually becoming more precise in the sense that it learns to include positional information only when beneficial, maintaining stochasticity to preserve regularization benefits.

Core construction. Similar to STAR, we first project each channel’s representation using:

h_{c} = GELU (W_{1} x_{c}), z_{c} = W_{2} h_{c}

where

x_{c} \in R^{d_{series}}

h_{c} \in R^{d_{series}}

, and

z_{c} \in R^{d_{core}}

. We then apply stochastic pooling to form a global core representation. During training, this involves sampling based on softmax-derived probabilities; during inference, we use a weighted average. This pooling approach reduces sensitivity to outliers and avoids overfitting to noisy channels.

Fusion and regularization. The core representation is concatenated with each input channel and processed as:

f_{c} = Concat (x_{c}, o), x_{c}^{'} = W_{4} (Dropout (GELU (W_{3} f_{c})))

where multiple dropout layers are employed to improve regularization. A residual connection is added from the input to the output, facilitating stable optimization and gradient flow.

Summary of improvements. STAR++ enhances the original STAR module by:

Introducing a learnable threshold to regulate positional embedding injection,

Applying multi-stage dropout for better regularization,

Preserving linear computational complexity,

Improving robustness to anomalies and generalization across datasets.

These architectural improvements enable SOFTS++ to consistently outperform baseline models while maintaining efficiency and scalability. The final structure of the STAR++ is described in Algorithm 2.

Occasional activation as regularization. The design of STAR++ draws inspiration from our previous work on HASPFormer, where a learnable threshold decides whether to apply self-attention or fallback to stochastic pooling. This conditional mechanism proved effective not only for improving accuracy, but also as a form of regularization. Similarly, in STAR++, the stochastic activation of positional embeddings introduces controlled variability during training, helping the model generalize better. More broadly, such selective activation strategies may represent a promising new class of regularization techniques, particularly in the context of large models where full self-attention at every layer is computationally expensive. By occasionally invoking complex operations like attention only when needed, models can reduce training time and resource usage without sacrificing performance.

Computational complexity. SOFTS has linear complexity in both time and memory.³³ The SOFTS++ model preserves the linear time and memory complexity of SOFTS. Both dropout layers and learnable stochastic positional embeddings introduce only element-wise operations and simple additive terms, which scale proportionally with the input size. As a result, these modifications do not alter the asymptotic complexity, and SOFTS++ remains a linear-complexity model with respect to the number of channels and sequence length.

Random omission of positional embeddings. During training, STAR++ applies positional embeddings with probability $p$ defined by a learnable threshold $τ$ :

\tilde{X} = {\begin{cases} X + P, & if u \leq τ, \\ X, & otherwise, \end{cases} u \sim U (0, 1) .

This acts as stochastic masking of the positional component, similar to Dropout²⁸ and Stochastic Depth.³⁴ Occasionally omitting positional information serves as regularization, reducing overfitting and improving robustness.

5. Experiments and results

To evaluate the effectiveness of SOFTS++, we carried out a comprehensive set of experiments across 12 widely used multivariate long-term forecasting benchmarks. We retained the same core size, encoder depth, and other architectural hyperparameters as proposed in the original SOFTS implementation¹⁴ to ensure a fair comparison. All models were trained using the Adam optimizer with a learning rate of $1 e^{- 4}$ , and early stopping was applied based on validation loss to prevent overfitting. The training was conducted on a single NVIDIA A100 GPU with 40GB of VRAM, allowing us to handle both high-resolution datasets (e.g., PEMS) and long input sequences efficiently. We report both Mean Squared Error (MSE) and Mean Absolute Error (MAE) for each prediction length and dataset, and include robustness evaluations across multiple random seeds. To promote transparency and reproducibility, the complete experimental pipeline, including the pre-processing of the data set, training scripts, and evaluation tools, is publicly available in the project GitHub repository: https://github.com/hljubic/SOFTSpp.

5.1. Datasets and methodology

The datasets used in this study are adopted from Han et al.,¹⁴ which provides a comprehensive overview of time-series benchmarks relevant to forecasting and predictive modelling. Each dataset originates from a specific domain, including energy, transportation, and weather. For example, the ETT (Electricity Transformer Temperature) dataset contains hourly and 15-minute measurements of transformer temperatures from 2016 to 2018, along with seven features related to oil and load conditions.

The Traffic dataset includes hourly road occupancy rates on San Francisco freeways from 2015 to 2016. The Electricity dataset records hourly power usage for 321 clients between 2012 and 2014. The Weather dataset provides 21 weather indicators, such as air temperature and humidity, sampled every 10 minutes throughout 2020 in Germany. The Solar-Energy dataset includes 10-minute solar production data from 137 photovoltaic plants collected in 2006. Finally, the PEMS dataset offers 5-minute traffic data from sensor networks in California, commonly used for short-term traffic forecasting. Detailed attributes for each dataset, including the number of channels, sampling rate, prediction length, and train/validation/test splits, are provided in Table 1.

Table 1.
Detailed dataset descriptions.

Dataset Channels Prediction Length Dataset Split Granularity Domain

ETTh1, ETTh2 7 {96, 192, 336, 720} (8545, 2881, 2881) Hourly Electricity

ETTm1, ETTm2 7 {96, 192, 336, 720} (34465, 11521, 11521) 15min Electricity

Weather 21 {96, 192, 336, 720} (36792, 5271, 10540) 10min Weather

ECL 321 {96, 192, 336, 720} (18317, 2633, 5261) Hourly Electricity

Traffic 862 {96, 192, 336, 720} (12185, 1757, 3509) Hourly Transportation

Solar-Energy 137 {96, 192, 336, 720} (36601, 5161, 10417) 10min Energy

PEMS03 358 {12, 24, 48, 96} (15617,5135,5135) 5min Transportation

PEMS04 307 {12, 24, 48, 96} (10172,3375,281) 5min Transportation

PEMS07 883 {12, 24, 48, 96} (16911,5622,468) 5min Transportation

PEMS08 170 {12, 24, 48, 96} (10690,3548,265) 5min Transportation

Dataset	Channels	Prediction Length	Dataset Split	Granularity	Domain
ETTh1, ETTh2	7	{96, 192, 336, 720}	(8545, 2881, 2881)	Hourly	Electricity
ETTm1, ETTm2	7	{96, 192, 336, 720}	(34465, 11521, 11521)	15min	Electricity
Weather	21	{96, 192, 336, 720}	(36792, 5271, 10540)	10min	Weather
ECL	321	{96, 192, 336, 720}	(18317, 2633, 5261)	Hourly	Electricity
Traffic	862	{96, 192, 336, 720}	(12185, 1757, 3509)	Hourly	Transportation
Solar-Energy	137	{96, 192, 336, 720}	(36601, 5161, 10417)	10min	Energy
PEMS03	358	{12, 24, 48, 96}	(15617,5135,5135)	5min	Transportation
PEMS04	307	{12, 24, 48, 96}	(10172,3375,281)	5min	Transportation
PEMS07	883	{12, 24, 48, 96}	(16911,5622,468)	5min	Transportation
PEMS08	170	{12, 24, 48, 96}	(10690,3548,265)	5min	Transportation

5.2. Experimental analysis

Ablation study. Table 2 presents the results of the ablation study conducted on the SOFTS++ architecture, which was extended with LSPE (Learnable Stochastic Positional Embedding) and three dropout layers. The objective of this experiment was to assess the contribution of each component to the overall performance. When only the third dropout layer was removed (w/o D3), the results remained very close to the full model, suggesting that a single dropout layer does not have a dramatic impact on its own, but still contributes to model performance. Removing dropout layers 2 and 3 (w/o D2-3), or all three dropout layers (w/o D1-3), leads to a more noticeable degradation, indicating that the combined effect of dropout layers is essential for improved stability and generalization.

Table 2.
Ablation results for full model and its variants across all datasets with fixed lookback window length $L = 96$ . Results are averaged over prediction horizons.

Dataset Full model w/o D3 w/o D2-3 w/o D1-3 Only PE Baseline

Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE

ECL 0.165 0.258 0.167 0.260 0.170 0.263 0.170 0.262 0.172 0.264 0.175 0.266

ETTh1 0.446 0.441 0.445 0.441 0.446 0.441 0.446 0.442 0.445 0.441 0.452 0.446

ETTh2 0.381 0.406 0.383 0.406 0.382 0.406 0.381 0.406 0.381 0.406 0.380 0.404

ETTm1 0.390 0.400 0.392 0.402 0.395 0.403 0.395 0.403 0.397 0.403 0.397 0.405

ETTm2 0.287 0.329 0.287 0.329 0.286 0.328 0.285 0.328 0.284 0.326 0.289 0.332

PEMS03 0.096 0.201 0.095 0.200 0.096 0.200 0.096 0.201 0.096 0.196 0.107 0.213

PEMS04 0.090 0.195 0.090 0.194 0.090 0.194 0.090 0.194 0.084 0.185 0.104 0.210

PEMS07 0.076 0.170 0.078 0.172 0.078 0.172 0.076 0.170 0.077 0.165 0.090 0.189

PEMS08 0.124 0.208 0.126 0.209 0.125 0.208 0.128 0.210 0.176 0.240 0.141 0.223

Solar 0.227 0.256 0.227 0.256 0.228 0.257 0.228 0.257 0.226 0.254 0.230 0.256

Traffic 0.412 0.267 0.418 0.268 0.416 0.268 0.417 0.268 0.456 0.279 0.409 0.267

Weather 0.250 0.274 0.250 0.275 0.251 0.276 0.252 0.276 0.247 0.275 0.258 0.280

Average 0.245 0.284 0.247 0.284 0.247 0.285 0.247 0.285 0.253 0.286 0.253 0.291

Dataset	Full model	w/o D3	w/o D2-3	w/o D1-3	Only PE	Baseline
ECL	0.165	0.258	0.167	0.260	0.170	0.263	0.170	0.262	0.172	0.264	0.175	0.266
ETTh1	0.446	0.441	0.445	0.441	0.446	0.441	0.446	0.442	0.445	0.441	0.452	0.446
ETTh2	0.381	0.406	0.383	0.406	0.382	0.406	0.381	0.406	0.381	0.406	0.380	0.404
ETTm1	0.390	0.400	0.392	0.402	0.395	0.403	0.395	0.403	0.397	0.403	0.397	0.405
ETTm2	0.287	0.329	0.287	0.329	0.286	0.328	0.285	0.328	0.284	0.326	0.289	0.332
PEMS03	0.096	0.201	0.095	0.200	0.096	0.200	0.096	0.201	0.096	0.196	0.107	0.213
PEMS04	0.090	0.195	0.090	0.194	0.090	0.194	0.090	0.194	0.084	0.185	0.104	0.210
PEMS07	0.076	0.170	0.078	0.172	0.078	0.172	0.076	0.170	0.077	0.165	0.090	0.189
PEMS08	0.124	0.208	0.126	0.209	0.125	0.208	0.128	0.210	0.176	0.240	0.141	0.223
Solar	0.227	0.256	0.227	0.256	0.228	0.257	0.228	0.257	0.226	0.254	0.230	0.256
Traffic	0.412	0.267	0.418	0.268	0.416	0.268	0.417	0.268	0.456	0.279	0.409	0.267
Weather	0.250	0.274	0.250	0.275	0.251	0.276	0.252	0.276	0.247	0.275	0.258	0.280
Average	0.245	0.284	0.247	0.284	0.247	0.285	0.247	0.285	0.253	0.286	0.253	0.291

Furthermore, when the LSPE component was removed and only the fixed positional embedding (PE) was retained (Only PE), performance degradation became evident across nearly all datasets. The effect was particularly severe on the Traffic dataset, where the MSE increased from 0.409 (baseline SOFTS) to 0.456 with fixed PE, demonstrating that traditional positional encodings are insufficient to capture the complex temporal patterns present in traffic data. Although LSPE does not outperform the baseline across all datasets, it substantially mitigates this issue by bringing the error very close to the original value (0.412 on Traffic), thereby confirming its role in providing more adaptive and stable temporal representations.

In comparison to the baseline model (SOFTS++ without LSPE and dropouts - plain SOFTS), it is evident that each added component has a clear justification: dropout layers provide regularization and prevent overfitting, while LSPE enhances the adaptability of positional information. The final integrated model achieves the best overall performance (0.245 MSE and 0.284 MAE on average), demonstrating that the combination of all components yields the most consistent and robust improvements across datasets. All experiments were conducted with five different random seeds, and the reported results are averaged values.

Positional embedding strategies. Tables 3 and 4 reports the ablation study of different positional embedding strategies within the full model configuration, where the dropout rate was fixed to 0.1 and LSPE was employed as the baseline approach. Five different embedding variants were evaluated across 12 datasets, and all experiments were conducted with five random seeds, with results averaged over prediction horizons. The first configuration, which uses the classical positional embedding that was initially adopted in our design, consistently achieves the best performance across datasets, yielding the lowest average error (0.245 MSE and 0.284 MAE). This validates our initial choice, as it provides the most stable and robust temporal representations compared to the alternatives.

Table 3.

Ablation results for different positional embedding strategies across all datasets with fixed lookback window length $L = 96$ . Results are averaged over prediction horizons.

Dataset	Pos. Embed.		Learnable		Rotary		Relative		None
Metric	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE
ECL	0.165	0.258	0.167	0.260	0.167	0.259	0.173	0.263	0.178	0.266
ETTh1	0.446	0.441	0.449	0.443	0.451	0.444	0.452	0.446	0.444	0.439
ETTh2	0.381	0.406	0.386	0.409	0.381	0.405	0.380	0.405	0.381	0.406
ETTm1	0.390	0.400	0.390	0.401	0.389	0.400	0.390	0.400	0.388	0.399
ETTm2	0.287	0.329	0.283	0.325	0.282	0.326	0.286	0.329	0.280	0.325
PEMS03	0.096	0.201	0.093	0.198	0.096	0.200	0.102	0.208	0.119	0.220
PEMS04	0.090	0.195	0.086	0.191	0.088	0.194	0.100	0.206	0.118	0.222
PEMS07	0.076	0.170	0.076	0.169	0.079	0.173	0.088	0.186	0.116	0.207
PEMS08	0.124	0.208	0.125	0.208	0.134	0.216	0.134	0.215	0.160	0.235
Solar	0.227	0.256	0.225	0.256	0.229	0.258	0.228	0.255	0.241	0.268
Traffic	0.412	0.267	0.426	0.268	0.419	0.270	0.409	0.267	0.445	0.276
Weather	0.250	0.274	0.255	0.278	0.251	0.276	0.254	0.276	0.248	0.274
Average	0.245	0.284	0.247	0.284	0.247	0.285	0.250	0.288	0.260	0.295

Table 4.

Ablation results for different dropout rates across all datasets with fixed lookback window length $L = 96$ . Results are averaged over prediction horizons.

Dataset	0.05		0.1		0.2		0.3		0.5
Metric	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE
ECL	0.167	0.259	0.165	0.258	0.164	0.257	0.162	0.255	0.163	0.255
ETTh1	0.446	0.440	0.446	0.441	0.446	0.441	0.446	0.441	0.452	0.444
ETTh2	0.380	0.405	0.381	0.406	0.381	0.406	0.380	0.406	0.380	0.406
ETTm1	0.393	0.402	0.390	0.400	0.388	0.399	0.388	0.399	0.389	0.400
ETTm2	0.286	0.328	0.287	0.329	0.285	0.327	0.283	0.326	0.282	0.325
PEMS03	0.095	0.200	0.096	0.201	0.094	0.199	0.093	0.198	0.094	0.200
PEMS04	0.088	0.193	0.090	0.195	0.089	0.195	0.090	0.197	0.091	0.198
PEMS07	0.076	0.169	0.076	0.170	0.076	0.172	0.076	0.172	0.079	0.176
PEMS08	0.128	0.211	0.124	0.208	0.123	0.208	0.124	0.209	0.124	0.209
Solar	0.227	0.256	0.227	0.256	0.227	0.256	0.226	0.256	0.227	0.258
Traffic	0.414	0.267	0.412	0.267	0.412	0.267	0.412	0.268	0.414	0.269
Weather	0.250	0.274	0.250	0.274	0.249	0.274	0.248	0.272	0.246	0.271
Average	0.246	0.284	0.245	0.284	0.245	0.283	0.244	0.283	0.245	0.284

Replacing the classical embedding with a learnable embedding (Learnable) results in a slight degradation of performance, although the difference is not large (average MSE $=$ 0.247, MAE $=$ 0.284). Rotary embeddings (Rotary) show similar behaviour, with performance almost identical to the learnable variant (average MSE $=$ 0.247, MAE $=$ 0.285), indicating that these approaches can approximate but not surpass the classical formulation. The use of relative embeddings (Relative) leads to a more noticeable decline (average MSE $=$ 0.250, MAE $=$ 0.288), with several datasets such as Traffic and Solar showing significant performance drops, which suggests that this type of embedding is less effective in capturing long-term dependencies within our setup. It is important to note that the embeddings used here are not their original variants, but have been adapted to this specific type of task as described earlier in the paper.

Finally, the removal of LSPE entirely (None) was included as a control to demonstrate the contribution of positional encoding itself. The results show a clear deterioration in performance (average MSE $=$ 0.260, MAE $=$ 0.295), confirming that positional information is critical for effective forecasting. Taken together, these results demonstrate that while alternative embedding strategies provide competitive outcomes, the original classical positional embedding remains the most effective and balanced choice within our architecture.

Sensitivity to dropout rate. After confirming the importance of all model components and establishing that the classical positional embedding is the most suitable choice, we further examined the sensitivity of the model to different dropout strengths before running the main experiments. The same dropout rate was applied uniformly across all three dropout layers. Results are presented in Table 4, averaged over 12 datasets and five random seeds. The comparison shows that very low dropout (0.05) and very high dropout (0.5) yield weaker results, as they either fail to provide sufficient regularization or remove too much information. Intermediate settings between 0.1 and 0.3 perform more consistently, with the rate of 0.3 achieving the best overall average performance (0.244 MSE and 0.283 MAE). Based on these findings, a dropout strength of 0.3 was selected for the subsequent main experiments.

5.3. Accuracy results

We evaluated SOFTS++ on 12 standard long-term forecasting datasets and compared it against ten competitive baselines, including SOFTS, iTransformer, PatchTST, TSMixer, Crossformer, TiDE,³⁵ TimesNet,³⁶ DLinear, and SCINet.³⁷ Since SOFTS++ builds directly upon the SOFTS model, ensuring a fair and consistent comparison between these two architectures was of particular importance. For this reason, we did not reuse the SOFTS results reported in their paper, but instead re-ran the experiments for both SOFTS++ and SOFTS under identical conditions. This was necessary because the original experiments used different random seeds, which could introduce discrepancies. While the differences in average metrics are minimal and do not impact the summary table (Table 5), some variations are observed in the full results provided in the appendix (Table 7). Across both MSE and MAE metrics, SOFTS++ achieved the best average performance overall, with an average MSE of 0.245 and MAE of 0.284, outperforming all other models in the majority of datasets.

Table 5.
Multivariate forecasting results with prediction lengths $H \in {12, 24, 48, 96}$ for PEMS and $H \in {96, 192, 336, 720}$ for others and fixed lookback window length $L = 96$ . Results are averaged from all prediction horizons. Full results are listed in table 7.

Models SOFTS++ SOFTS iTransformer PatchTST TSMixer Crossformer TiDE TimesNet DLinear SCINet

Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE

ECL 0.162 0.255 0.175 0.266 0.178 0.270 0.189 0.276 0.186 0.287 0.244 0.334 0.251 0.344 0.192 0.295 0.212 0.300 0.268 0.365

Traffic 0.412 0.268 0.409 0.267 0.428 0.282 0.454 0.286 0.522 0.357 0.550 0.304 0.760 0.473 0.620 0.336 0.625 0.383 0.804 0.509

Weather 0.248 0.272 0.258 0.280 0.258 0.278 0.256 0.279 0.256 0.279 0.259 0.315 0.271 0.320 0.259 0.287 0.265 0.317 0.292 0.363

Solar-Energy 0.226 0.256 0.229 0.256 0.233 0.262 0.236 0.266 0.260 0.297 0.641 0.639 0.347 0.417 0.301 0.319 0.330 0.401 0.282 0.375

ETTm1 0.388 0.399 0.393 0.403 0.407 0.410 0.396 0.406 0.398 0.407 0.513 0.496 0.419 0.419 0.400 0.406 0.403 0.407 0.485 0.481

ETTm2 0.283 0.326 0.287 0.330 0.288 0.332 0.287 0.330 0.289 0.333 0.757 0.610 0.358 0.404 0.291 0.333 0.350 0.401 0.571 0.537

ETTh1 0.446 0.441 0.449 0.442 0.454 0.447 0.453 0.446 0.463 0.452 0.529 0.522 0.541 0.507 0.458 0.450 0.456 0.452 0.747 0.647

ETTh2 0.380 0.406 0.380 0.404 0.383 0.407 0.385 0.410 0.401 0.417 0.942 0.684 0.611 0.550 0.414 0.427 0.559 0.515 0.954 0.723

PEMS03 0.093 0.198 0.107 0.213 0.113 0.221 0.137 0.240 0.119 0.233 0.169 0.281 0.326 0.419 0.147 0.248 0.278 0.375 0.114 0.224

PEMS04 0.090 0.197 0.104 0.210 0.111 0.221 0.145 0.249 0.103 0.215 0.209 0.314 0.353 0.437 0.129 0.241 0.295 0.388 0.092 0.202

PEMS07 0.076 0.172 0.090 0.189 0.101 0.204 0.144 0.233 0.112 0.217 0.235 0.315 0.380 0.440 0.124 0.225 0.329 0.395 0.119 0.234

PEMS08 0.124 0.209 0.141 0.223 0.150 0.226 0.200 0.275 0.165 0.261 0.268 0.307 0.441 0.464 0.193 0.271 0.379 0.416 0.158 0.244

Average 0.244 0.283 0.252 0.290 0.259 0.297 0.274 0.308 0.273 0.313 0.443 0.427 0.422 0.433 0.294 0.320 0.373 0.396 0.407 0.409

$1^{st}$ Count 11 10 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Models	SOFTS++	SOFTS	iTransformer	PatchTST	TSMixer	Crossformer	TiDE	TimesNet	DLinear	SCINet
ECL	0.162	0.255	0.175	0.266	0.178	0.270	0.189	0.276	0.186	0.287	0.244	0.334	0.251	0.344	0.192	0.295	0.212	0.300	0.268	0.365
Traffic	0.412	0.268	0.409	0.267	0.428	0.282	0.454	0.286	0.522	0.357	0.550	0.304	0.760	0.473	0.620	0.336	0.625	0.383	0.804	0.509
Weather	0.248	0.272	0.258	0.280	0.258	0.278	0.256	0.279	0.256	0.279	0.259	0.315	0.271	0.320	0.259	0.287	0.265	0.317	0.292	0.363
Solar-Energy	0.226	0.256	0.229	0.256	0.233	0.262	0.236	0.266	0.260	0.297	0.641	0.639	0.347	0.417	0.301	0.319	0.330	0.401	0.282	0.375
ETTm1	0.388	0.399	0.393	0.403	0.407	0.410	0.396	0.406	0.398	0.407	0.513	0.496	0.419	0.419	0.400	0.406	0.403	0.407	0.485	0.481
ETTm2	0.283	0.326	0.287	0.330	0.288	0.332	0.287	0.330	0.289	0.333	0.757	0.610	0.358	0.404	0.291	0.333	0.350	0.401	0.571	0.537
ETTh1	0.446	0.441	0.449	0.442	0.454	0.447	0.453	0.446	0.463	0.452	0.529	0.522	0.541	0.507	0.458	0.450	0.456	0.452	0.747	0.647
ETTh2	0.380	0.406	0.380	0.404	0.383	0.407	0.385	0.410	0.401	0.417	0.942	0.684	0.611	0.550	0.414	0.427	0.559	0.515	0.954	0.723
PEMS03	0.093	0.198	0.107	0.213	0.113	0.221	0.137	0.240	0.119	0.233	0.169	0.281	0.326	0.419	0.147	0.248	0.278	0.375	0.114	0.224
PEMS04	0.090	0.197	0.104	0.210	0.111	0.221	0.145	0.249	0.103	0.215	0.209	0.314	0.353	0.437	0.129	0.241	0.295	0.388	0.092	0.202
PEMS07	0.076	0.172	0.090	0.189	0.101	0.204	0.144	0.233	0.112	0.217	0.235	0.315	0.380	0.440	0.124	0.225	0.329	0.395	0.119	0.234
PEMS08	0.124	0.209	0.141	0.223	0.150	0.226	0.200	0.275	0.165	0.261	0.268	0.307	0.441	0.464	0.193	0.271	0.379	0.416	0.158	0.244
Average	0.244	0.283	0.252	0.290	0.259	0.297	0.274	0.308	0.273	0.313	0.443	0.427	0.422	0.433	0.294	0.320	0.373	0.396	0.407	0.409
$1^{st}$ Count	11	10	2	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

SOFTS++ obtained the best results (lowest MSE and/or MAE) on 10 out of 12 datasets in MSE and 11 out of 12 in MAE. It showed particularly strong improvements on high-frequency traffic datasets (PEMS03, PEMS04, PEMS07, PEMS08), where the model consistently achieved significantly lower error values compared to all baselines. This suggests that SOFTS++ is especially effective in capturing fine-grained temporal dependencies in transportation data.

In contrast, the performance on datasets such as ETTh1 and ETTh2 was slightly below that of the SOFTS baseline, though the difference was marginal and did not substantially impact the overall average. Notably, the use of robust data embedding and regularization mechanisms in SOFTS++ contributed to its increased robustness across diverse domains and forecasting horizons. The strong and consistent performance of SOFTS++ confirms its state-of-the-art effectiveness for multivariate long-range time series forecasting.

5.4. Robustness results

To evaluate the robustness of STAR++, we compared its forecasting performance with the original STAR module across six benchmark datasets and four prediction horizons. Table 6 presents the results averaged over five random seeds, reporting both the mean and standard deviation for MSE and MAE metrics. Overall, STAR++ consistently achieves a lower forecasting error than STAR across most datasets and horizons. However, when it comes to robustness, which refers to the consistency of results across different random seeds, the situation is more complex.

Table 6.
Comparison of the robustness of the STAR++ module with the baseline (STAR) module.

Dataset ETTm1 Weather Traffic

Horizon Module MSE MAE MSE MAE MSE MAE

$96$ Baseline 0.325 $\pm$ 0.008 0.363 $\pm$ 0.006 0.173 $\pm$ 0.003 0.214 $\pm$ 0.003 0.376 $\pm$ 0.002 0.251 $\pm$ 0.001

STAR++ 0.320 $\pm$ 0.008 0.357 $\pm$ 0.007 0.163 $\pm$ 0.002 0.204 $\pm$ 0.002 0.376 $\pm$ 0.001 0.250 $\pm$ 0.000

$192$ Baseline 0.378 $\pm$ 0.009 0.391 $\pm$ 0.005 0.221 $\pm$ 0.002 0.257 $\pm$ 0.002 0.398 $\pm$ 0.000 0.261 $\pm$ 0.000

STAR++ 0.369 $\pm$ 0.006 0.385 $\pm$ 0.005 0.210 $\pm$ 0.005 0.248 $\pm$ 0.003 0.401 $\pm$ 0.002 0.261 $\pm$ 0.000

$336$ Baseline 0.413 $\pm$ 0.011 0.415 $\pm$ 0.007 0.282 $\pm$ 0.002 0.300 $\pm$ 0.001 0.415 $\pm$ 0.001 0.269 $\pm$ 0.000

STAR++ 0.403 $\pm$ 0.005 0.409 $\pm$ 0.003 0.271 $\pm$ 0.002 0.293 $\pm$ 0.000 0.417 $\pm$ 0.002 0.270 $\pm$ 0.000

$720$ Baseline 0.471 $\pm$ 0.008 0.450 $\pm$ 0.004 0.358 $\pm$ 0.003 0.350 $\pm$ 0.002 0.448 $\pm$ 0.000 0.287 $\pm$ 0.000

STAR++ 0.462 $\pm$ 0.005 0.447 $\pm$ 0.004 0.348 $\pm$ 0.003 0.344 $\pm$ 0.002 0.452 $\pm$ 0.004 0.289 $\pm$ 0.001

Dataset PEMS03 PEMS04 PEMS07

Horizon Module MSE MAE MSE MAE MSE MAE

$12$ Baseline 0.068 $\pm$ 0.005 0.172 $\pm$ 0.008 0.074 $\pm$ 0.000 0.175 $\pm$ 0.000 0.059 $\pm$ 0.004 0.154 $\pm$ 0.007

STAR++ 0.062 $\pm$ 0.001 0.162 $\pm$ 0.002 0.069 $\pm$ 0.001 0.170 $\pm$ 0.001 0.057 $\pm$ 0.005 0.152 $\pm$ 0.007

$24$ Baseline 0.085 $\pm$ 0.001 0.190 $\pm$ 0.001 0.088 $\pm$ 0.000 0.194 $\pm$ 0.000 0.078 $\pm$ 0.007 0.178 $\pm$ 0.008

STAR++ 0.075 $\pm$ 0.001 0.179 $\pm$ 0.001 0.081 $\pm$ 0.007 0.187 $\pm$ 0.012 0.071 $\pm$ 0.001 0.170 $\pm$ 0.002

$48$ Baseline 0.119 $\pm$ 0.011 0.228 $\pm$ 0.014 0.114 $\pm$ 0.014 0.224 $\pm$ 0.020 0.100 $\pm$ 0.002 0.201 $\pm$ 0.006

STAR++ 0.102 $\pm$ 0.002 0.209 $\pm$ 0.002 0.100 $\pm$ 0.022 0.210 $\pm$ 0.028 0.080 $\pm$ 0.003 0.175 $\pm$ 0.006

$96$ Baseline 0.156 $\pm$ 0.002 0.263 $\pm$ 0.003 0.138 $\pm$ 0.002 0.245 $\pm$ 0.001 0.123 $\pm$ 0.005 0.222 $\pm$ 0.003

STAR++ 0.133 $\pm$ 0.004 0.241 $\pm$ 0.004 0.112 $\pm$ 0.003 0.220 $\pm$ 0.003 0.098 $\pm$ 0.004 0.190 $\pm$ 0.007

Dataset		ETTm1	Weather	Traffic
$96$	Baseline	0.325 $\pm$ 0.008	0.363 $\pm$ 0.006	0.173 $\pm$ 0.003	0.214 $\pm$ 0.003	0.376 $\pm$ 0.002	0.251 $\pm$ 0.001
	STAR++	0.320 $\pm$ 0.008	0.357 $\pm$ 0.007	0.163 $\pm$ 0.002	0.204 $\pm$ 0.002	0.376 $\pm$ 0.001	0.250 $\pm$ 0.000
$192$	Baseline	0.378 $\pm$ 0.009	0.391 $\pm$ 0.005	0.221 $\pm$ 0.002	0.257 $\pm$ 0.002	0.398 $\pm$ 0.000	0.261 $\pm$ 0.000
	STAR++	0.369 $\pm$ 0.006	0.385 $\pm$ 0.005	0.210 $\pm$ 0.005	0.248 $\pm$ 0.003	0.401 $\pm$ 0.002	0.261 $\pm$ 0.000
$336$	Baseline	0.413 $\pm$ 0.011	0.415 $\pm$ 0.007	0.282 $\pm$ 0.002	0.300 $\pm$ 0.001	0.415 $\pm$ 0.001	0.269 $\pm$ 0.000
	STAR++	0.403 $\pm$ 0.005	0.409 $\pm$ 0.003	0.271 $\pm$ 0.002	0.293 $\pm$ 0.000	0.417 $\pm$ 0.002	0.270 $\pm$ 0.000
$720$	Baseline	0.471 $\pm$ 0.008	0.450 $\pm$ 0.004	0.358 $\pm$ 0.003	0.350 $\pm$ 0.002	0.448 $\pm$ 0.000	0.287 $\pm$ 0.000
	STAR++	0.462 $\pm$ 0.005	0.447 $\pm$ 0.004	0.348 $\pm$ 0.003	0.344 $\pm$ 0.002	0.452 $\pm$ 0.004	0.289 $\pm$ 0.001
Dataset		PEMS03	PEMS04	PEMS07
Horizon	Module	MSE	MAE	MSE	MAE	MSE	MAE
$12$	Baseline	0.068 $\pm$ 0.005	0.172 $\pm$ 0.008	0.074 $\pm$ 0.000	0.175 $\pm$ 0.000	0.059 $\pm$ 0.004	0.154 $\pm$ 0.007
	STAR++	0.062 $\pm$ 0.001	0.162 $\pm$ 0.002	0.069 $\pm$ 0.001	0.170 $\pm$ 0.001	0.057 $\pm$ 0.005	0.152 $\pm$ 0.007
$24$	Baseline	0.085 $\pm$ 0.001	0.190 $\pm$ 0.001	0.088 $\pm$ 0.000	0.194 $\pm$ 0.000	0.078 $\pm$ 0.007	0.178 $\pm$ 0.008
	STAR++	0.075 $\pm$ 0.001	0.179 $\pm$ 0.001	0.081 $\pm$ 0.007	0.187 $\pm$ 0.012	0.071 $\pm$ 0.001	0.170 $\pm$ 0.002
$48$	Baseline	0.119 $\pm$ 0.011	0.228 $\pm$ 0.014	0.114 $\pm$ 0.014	0.224 $\pm$ 0.020	0.100 $\pm$ 0.002	0.201 $\pm$ 0.006
	STAR++	0.102 $\pm$ 0.002	0.209 $\pm$ 0.002	0.100 $\pm$ 0.022	0.210 $\pm$ 0.028	0.080 $\pm$ 0.003	0.175 $\pm$ 0.006
$96$	Baseline	0.156 $\pm$ 0.002	0.263 $\pm$ 0.003	0.138 $\pm$ 0.002	0.245 $\pm$ 0.001	0.123 $\pm$ 0.005	0.222 $\pm$ 0.003
	STAR++	0.133 $\pm$ 0.004	0.241 $\pm$ 0.004	0.112 $\pm$ 0.003	0.220 $\pm$ 0.003	0.098 $\pm$ 0.004	0.190 $\pm$ 0.007

Table 7.

Multivariate forecasting results with prediction lengths $H \in {12, 24, 48, 96}$ for PEMS and $H \in {96, 192, 336, 720}$ for others and fixed lookback window length $L = 96$ .

Models		SOFTS++		SOFTS		iTransformer		PatchTST		TSMixer		Crossformer		TiDE		TimesNet		DLinear		SCINet
Metric		MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE
ETTm1	96	0.320	0.357	0.325	0.363	0.334	0.368	0.329	0.365	0.323	0.363	0.404	0.426	0.364	0.387	0.338	0.375	0.345	0.372	0.418	0.438
	192	0.369	0.385	0.378	0.391	0.377	0.391	0.380	0.394	0.376	0.392	0.450	0.451	0.398	0.404	0.374	0.387	0.380	0.389	0.439	0.450
	336	0.403	0.409	0.413	0.415	0.426	0.420	0.400	0.410	0.407	0.413	0.532	0.515	0.428	0.425	0.410	0.411	0.413	0.413	0.490	0.485
	720	0.462	0.447	0.471	0.450	0.491	0.459	0.475	0.453	0.485	0.459	0.666	0.589	0.487	0.461	0.478	0.450	0.474	0.453	0.595	0.550
	Avg	0.388	0.399	0.397	0.405	0.407	0.410	0.396	0.406	0.398	0.407	0.513	0.496	0.419	0.419	0.400	0.406	0.403	0.407	0.485	0.481
ETTm2	96	0.178	0.259	0.181	0.263	0.180	0.264	0.184	0.264	0.182	0.266	0.287	0.366	0.207	0.305	0.187	0.267	0.193	0.292	0.286	0.377
	192	0.243	0.302	0.249	0.308	0.250	0.309	0.246	0.306	0.249	0.309	0.414	0.492	0.290	0.364	0.249	0.309	0.284	0.362	0.399	0.445
	336	0.308	0.344	0.314	0.350	0.311	0.348	0.308	0.346	0.309	0.347	0.597	0.542	0.377	0.422	0.321	0.351	0.369	0.427	0.637	0.591
	720	0.404	0.400	0.411	0.405	0.412	0.407	0.409	0.402	0.416	0.408	1.730	1.042	0.558	0.524	0.408	0.403	0.554	0.522	0.960	0.735
	Avg	0.283	0.326	0.289	0.332	0.288	0.332	0.287	0.330	0.289	0.333	0.757	0.610	0.358	0.404	0.291	0.333	0.350	0.401	0.571	0.537
ETTh1	96	0.379	0.398	0.384	0.402	0.386	0.405	0.394	0.406	0.401	0.412	0.423	0.448	0.479	0.464	0.384	0.402	0.386	0.400	0.654	0.599
	192	0.435	0.431	0.444	0.440	0.441	0.436	0.440	0.435	0.452	0.442	0.471	0.474	0.525	0.492	0.436	0.429	0.437	0.432	0.719	0.631
	336	0.477	0.451	0.478	0.454	0.487	0.458	0.491	0.462	0.492	0.463	0.570	0.546	0.565	0.515	0.491	0.469	0.481	0.459	0.778	0.659
	720	0.494	0.483	0.500	0.489	0.503	0.491	0.487	0.479	0.507	0.490	0.653	0.621	0.594	0.558	0.521	0.500	0.519	0.516	0.836	0.699
	Avg	0.446	0.441	0.452	0.446	0.454	0.447	0.453	0.446	0.463	0.452	0.529	0.522	0.541	0.507	0.458	0.450	0.456	0.452	0.747	0.647
ETTh2	96	0.297	0.347	0.296	0.347	0.297	0.349	0.288	0.340	0.319	0.361	0.745	0.584	0.400	0.440	0.340	0.374	0.333	0.387	0.707	0.621
	192	0.373	0.395	0.374	0.395	0.380	0.400	0.376	0.395	0.402	0.410	0.877	0.656	0.528	0.509	0.402	0.414	0.477	0.476	0.860	0.689
	336	0.420	0.432	0.425	0.432	0.428	0.432	0.440	0.451	0.444	0.446	1.043	0.731	0.643	0.571	0.452	0.452	0.594	0.541	1.000	0.744
	720	0.431	0.447	0.425	0.442	0.427	0.445	0.436	0.453	0.441	0.450	1.104	0.763	0.874	0.679	0.462	0.468	0.831	0.657	1.249	0.838
	Avg	0.380	0.406	0.380	0.404	0.383	0.407	0.385	0.410	0.401	0.417	0.942	0.684	0.611	0.550	0.414	0.427	0.559	0.515	0.954	0.723
ECL	96	0.135	0.228	0.144	0.235	0.148	0.240	0.164	0.251	0.157	0.260	0.219	0.314	0.237	0.329	0.168	0.272	0.197	0.282	0.247	0.345
	192	0.152	0.244	0.161	0.251	0.162	0.253	0.173	0.262	0.173	0.274	0.231	0.322	0.236	0.330	0.184	0.289	0.196	0.285	0.257	0.355
	336	0.167	0.261	0.178	0.270	0.178	0.269	0.190	0.279	0.192	0.295	0.246	0.337	0.249	0.344	0.198	0.300	0.209	0.301	0.269	0.369
	720	0.193	0.287	0.218	0.306	0.225	0.317	0.230	0.313	0.223	0.318	0.280	0.363	0.284	0.373	0.220	0.320	0.245	0.333	0.299	0.390
	Avg	0.162	0.255	0.175	0.266	0.178	0.270	0.189	0.276	0.186	0.287	0.244	0.334	0.251	0.344	0.192	0.295	0.212	0.300	0.268	0.365
Traffic	96	0.376	0.250	0.376	0.251	0.395	0.268	0.427	0.272	0.493	0.336	0.522	0.290	0.805	0.493	0.593	0.321	0.650	0.396	0.788	0.499
	192	0.401	0.261	0.398	0.261	0.417	0.276	0.454	0.289	0.497	0.351	0.530	0.293	0.756	0.474	0.617	0.336	0.598	0.370	0.789	0.505
	336	0.417	0.270	0.415	0.269	0.433	0.283	0.450	0.282	0.528	0.361	0.558	0.305	0.762	0.477	0.629	0.336	0.605	0.373	0.797	0.508
	720	0.452	0.289	0.448	0.287	0.467	0.302	0.484	0.301	0.569	0.380	0.589	0.328	0.719	0.449	0.640	0.350	0.645	0.394	0.841	0.523
	Avg	0.412	0.268	0.409	0.267	0.428	0.282	0.454	0.286	0.522	0.357	0.550	0.304	0.760	0.473	0.620	0.336	0.625	0.383	0.804	0.509
Weather	96	0.163	0.204	0.173	0.214	0.174	0.214	0.176	0.217	0.166	0.210	0.158	0.230	0.202	0.261	0.172	0.220	0.196	0.255	0.221	0.306
	192	0.210	0.248	0.221	0.257	0.221	0.254	0.221	0.256	0.215	0.256	0.206	0.277	0.242	0.298	0.219	0.261	0.237	0.296	0.261	0.340
	336	0.271	0.293	0.282	0.300	0.278	0.296	0.275	0.296	0.287	0.300	0.272	0.335	0.287	0.335	0.280	0.306	0.283	0.335	0.309	0.378
	720	0.348	0.344	0.358	0.350	0.358	0.347	0.352	0.346	0.355	0.348	0.398	0.418	0.351	0.386	0.365	0.359	0.345	0.381	0.377	0.427
	Avg	0.248	0.272	0.258	0.280	0.258	0.278	0.256	0.279	0.256	0.279	0.259	0.315	0.271	0.320	0.259	0.287	0.265	0.317	0.292	0.363
Solar-Energy	96	0.196	0.227	0.202	0.231	0.203	0.237	0.205	0.246	0.221	0.275	0.310	0.331	0.312	0.399	0.250	0.292	0.290	0.378	0.237	0.344
	192	0.227	0.256	0.231	0.255	0.233	0.261	0.237	0.267	0.268	0.306	0.734	0.725	0.339	0.416	0.296	0.318	0.320	0.398	0.280	0.380
	336	0.239	0.269	0.242	0.268	0.248	0.273	0.250	0.276	0.272	0.294	0.750	0.735	0.368	0.430	0.319	0.330	0.353	0.415	0.304	0.389
	720	0.244	0.273	0.245	0.271	0.249	0.275	0.252	0.275	0.281	0.313	0.769	0.765	0.370	0.425	0.338	0.337	0.356	0.413	0.308	0.388
	Avg	0.226	0.256	0.230	0.256	0.233	0.262	0.236	0.266	0.260	0.297	0.641	0.639	0.347	0.417	0.301	0.319	0.330	0.401	0.282	0.375
PEMS03	12	0.062	0.162	0.068	0.172	0.071	0.174	0.073	0.178	0.075	0.186	0.090	0.203	0.178	0.305	0.085	0.192	0.122	0.243	0.066	0.172
	24	0.075	0.179	0.085	0.190	0.093	0.201	0.105	0.212	0.095	0.210	0.121	0.240	0.257	0.371	0.118	0.223	0.201	0.317	0.085	0.198
	48	0.102	0.209	0.119	0.228	0.125	0.236	0.159	0.264	0.121	0.240	0.202	0.317	0.379	0.463	0.155	0.260	0.333	0.425	0.127	0.238
	96	0.133	0.241	0.156	0.263	0.164	0.275	0.210	0.305	0.184	0.295	0.262	0.367	0.490	0.539	0.228	0.317	0.457	0.515	0.178	0.287
	Avg	0.093	0.198	0.107	0.213	0.113	0.221	0.137	0.240	0.119	0.233	0.169	0.281	0.326	0.419	0.147	0.248	0.278	0.375	0.114	0.224
PEMS04	12	0.069	0.170	0.074	0.175	0.078	0.183	0.085	0.189	0.079	0.188	0.098	0.218	0.219	0.340	0.087	0.195	0.148	0.272	0.073	0.177
	24	0.081	0.187	0.088	0.194	0.095	0.205	0.115	0.222	0.089	0.201	0.131	0.256	0.292	0.398	0.103	0.215	0.224	0.340	0.084	0.193
	48	0.100	0.210	0.114	0.224	0.120	0.233	0.167	0.273	0.111	0.222	0.205	0.326	0.409	0.478	0.136	0.250	0.355	0.437	0.099	0.211
	96	0.112	0.220	0.138	0.245	0.150	0.262	0.211	0.310	0.133	0.247	0.402	0.457	0.492	0.532	0.190	0.303	0.452	0.504	0.114	0.227
	Avg	0.090	0.197	0.104	0.210	0.111	0.221	0.145	0.249	0.103	0.215	0.209	0.314	0.353	0.437	0.129	0.241	0.295	0.388	0.092	0.202
PEMS07	12	0.057	0.152	0.059	0.154	0.067	0.165	0.068	0.163	0.073	0.181	0.094	0.200	0.173	0.304	0.082	0.181	0.115	0.242	0.068	0.171
	24	0.071	0.170	0.078	0.178	0.088	0.190	0.102	0.201	0.090	0.199	0.139	0.247	0.271	0.383	0.101	0.204	0.210	0.329	0.119	0.225
	48	0.080	0.175	0.100	0.201	0.110	0.215	0.170	0.261	0.124	0.231	0.311	0.369	0.446	0.495	0.134	0.238	0.398	0.458	0.149	0.237
	96	0.098	0.190	0.123	0.222	0.139	0.245	0.236	0.308	0.163	0.255	0.396	0.442	0.628	0.577	0.181	0.279	0.594	0.553	0.141	0.234
	Avg	0.076	0.172	0.090	0.189	0.101	0.204	0.144	0.233	0.112	0.217	0.235	0.315	0.380	0.440	0.124	0.225	0.329	0.395	0.119	0.234
PEMS08	12	0.071	0.169	0.075	0.174	0.079	0.182	0.098	0.205	0.083	0.189	0.165	0.214	0.227	0.343	0.112	0.212	0.154	0.276	0.087	0.184
	24	0.098	0.199	0.106	0.204	0.115	0.219	0.162	0.266	0.117	0.226	0.215	0.260	0.318	0.409	0.141	0.238	0.248	0.353	0.122	0.221
	48	0.122	0.216	0.164	0.253	0.186	0.235	0.238	0.311	0.196	0.299	0.315	0.355	0.497	0.510	0.198	0.283	0.440	0.470	0.189	0.270
	96	0.205	0.251	0.219	0.262	0.221	0.267	0.303	0.318	0.266	0.331	0.377	0.397	0.721	0.592	0.320	0.351	0.674	0.565	0.236	0.300
	Avg	0.124	0.209	0.141	0.223	0.150	0.226	0.200	0.275	0.165	0.261	0.268	0.307	0.441	0.464	0.193	0.271	0.379	0.416	0.158	0.244
$1^{st}$ Count		38	38	5	11	0	1	4	3	0	0	2	0	0	0	0	0	1	0	0	0

A quantitative comparison between SOFTS and SOFTS++ shows that the differences are relatively small but highlight meaningful trends. On average, SOFTS achieves slightly lower mean values (0.04 for both MSE and MAE) compared to SOFTS++ (0.05 for both metrics), suggesting marginally higher robustness in terms of mean performance. However, when considering the median, which better reflects typical behaviour by reducing the impact of outliers, SOFTS++ demonstrates a clear advantage: for MSE the median is equal to SOFTS (0.03), but for MAE it improves to 0.02 compared to SOFTS’s 0.03, indicating more consistent gains in standard use cases.

Looking at variability across datasets, SOFTS again shows slightly better stability, with lower standard deviations (MSE: 0.004462, MAE: 0.004063) than SOFTS++ (MSE: 0.006361, MAE: 0.006391). Taken together, these findings suggest that while SOFTS can be considered more robust due to its lower variability and mean values, SOFTS++ provides consistent improvements in median performance, which indicates stronger results in typical real-world scenarios. This makes it a more practical choice for applications, even though SOFTS still has a slight advantage in overall robustness.

6. Conclusions and future work

In this paper, we present SOFTS++, an enhanced linear model for multivariate long-term time-series forecasting that builds on the SOFTS architecture. By integrating selectively applied positional embeddings, multiple dropout layers, and a robust variant of the STAR module (STAR++), our approach improves predictive accuracy across a wide range of benchmark datasets while maintaining linear complexity and low computational overhead.

Extensive experiments conducted on twelve standard datasets show that SOFTS++ outperforms competitive baselines, achieving the best average performance on 10 out of 12 datasets in MAE and 11 out of 12 in MSE. SOFTS++ achieves notably strong results on high-frequency traffic datasets (PEMS03–08), which are challenging due to their large number of channels and short sampling intervals. It consistently outperforms competing models such as iTransformer and PatchTST, especially at longer forecasting horizons. This advantage stems from the centralized STAR++ module, which reduces noise sensitivity and overfitting, while stochastic positional embeddings enhance generalization. The linear complexity of the model proves to be highly effective in handling datasets with many input channels, where traditional attention-based architectures struggle to scale.

Although SOFTS++ demonstrates improved accuracy, robustness analysis revealed non-consistent behaviour: while the model often maintains lower average errors, it occasionally exhibits higher variance across random seeds in specific configurations. A detailed study across all 48 forecasting setups showed a clear trade-off: SOFTS has lower variance, but SOFTS++ consistently achieves higher average accuracy. This indicates that SOFTS++ favours performance at the cost of slightly higher variability, which is an important aspect for practitioners to consider depending on their application.

Beyond accuracy and robustness, a key strength of SOFTS++ is its efficiency: the model retains linear complexity and was trained on a single GPU with modest runtime. This makes it suitable for resource-constrained settings such as edge devices and real-time traffic monitoring. Finally, our most significant finding is that positional embeddings need not be applied at every step; when used selectively, they yield superior results.

Future work. Our primary goal is to further improve the robustness of SOFTS++ without compromising efficiency. While the model achieves strong average performance, some datasets and horizons reveal higher variability across random seeds, making variance reduction an important next step. Future research will therefore explore new regularization methods, improved initialization, and alternative pooling mechanisms that could lower variance while preserving the linear-time and low-memory advantages of SOFTS++. Equally important is interpretability: although SOFTS++ is lightweight and scalable, understanding how individual inputs influence predictions is crucial in sensitive domains. Integrating the STAR++ module with tools such as SHAP³⁸ could provide fine-grained feature attributions and enhance transparency, which is valuable not only in energy forecasting and anomaly detection but also in traffic management, healthcare, and smart city applications.

6.1. Copyright

Footnotes

Acknowledgements

We gratefully acknowledge the support of the HR-ZOO project for providing access to NVIDIA A100 GPUs, which significantly contributed to the computational resources used in this work.

ORCID iDs

Hrvoje Ljubić

Goran Martinović

Tomislav Volarić

Robert Rozić

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Appendix A. Supplementary materials

References

Katsarou

. Sequential machine learning for textual and time-series data. 2025.

Miller

Aldosari

Saeed

, et al. A survey of deep learning and foundation models for time series forecasting. arXiv preprint arXiv:2401.13912, 2024.

Wang

Dong

, et al. Deep time series models: A comprehensive survey and benchmark. arXiv preprint arXiv:2407.13278, 2024.

Lim

Zohren

. Time-series forecasting with deep learning: a survey. Philosop Trans R Soc A 2021; 379: 20200209.

Liu

Zhang

, et al. itransformer: Inverted transformers are effective for time series forecasting. arXiv preprint arXiv:2310.06625, 2023.

Nie

Nguyen

Sinthong

, et al. A time series is worth 64 words: Long-term forecasting with transformers. arXiv preprint arXiv:2211.14730, 2022.

Côté

Liu

. Saits: Self-attention-based imputation for time series. Expert Syst Appl 2023; 219: 119619.

Tuli

Casale

Jennings

. Tranad: Deep transformer networks for anomaly detection in multivariate time series data. arXiv preprint arXiv:2201.07284, 2022.

Wen

Zhou

Zhang

, et al. Transformers in time series: A survey. arXiv preprint arXiv:2202.07125, 2022.

10.

Lan

Alyakin

Oermann

. Gateformer: Advancing multivariate time series forecasting through temporal and variate-wise attention with gated representations. arXiv preprint arXiv:2505.00307, 2025.

11.

Wang

, et al. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Adv Neural Inf Process Syst 2021; 34: 22419–22430.

12.

Zhou

Zhang

Peng

, et al. Informer: Beyond efficient transformer for long sequence time-series forecasting. In: Proceedings of the AAAI conference on artificial intelligence, Vol. 35, 2021, pp.11106–11115.

13.

Zeng

Chen

Zhang

, et al. Are transformers effective for time series forecasting? In: Proceedings of the AAAI conference on artificial intelligence, Vol. 37, 2023, pp.11121–11128.

14.

Han

Chen

, et al. Softs: Efficient multivariate time series forecasting with series-core fusion. arXiv preprint arXiv:2404.14197, 2024a.

15.

Zeiler

Fergus

. Stochastic pooling for regularization of deep convolutional neural networks. arXiv preprint arXiv:1301.3557, 2013.

16.

Hochreiter

Schmidhuber

. Long short-term memory. Neural Comput 1997; 9: 1735–1780.

17.

Cho

Van Merriënboer

Gulcehre

, et al. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.

18.

Pagliaro

. Artificial intelligence v’s. efficient markets: A critical reassessment of predictive models in the big data era. Electronics (2079-9292) 2025; 14.

19.

Spiliotis

. Time series forecasting with statistical, machine learning, and deep learning methods: Past, present, and future. In: Forecasting with artificial intelligence: Theory and applications, 2023, pp.49–75. Springer.

20.

Waqas

Humphries

. A critical review of rnn and lstm variants in hydrological time series predictions. MethodsX 2024; 13: 102946.

21.

Sun

. The evolution of transformer models from unidirectional to bidirectional in natural language processing. Appl Comput Eng 2024; 42: 281–289.

22.

Wibawa

Kurniawan

, et al. Advancements in natural language processing: Implications, challenges, and future directions. Telemat Informat Rep 2024; 16: 100173.

23.

Zhang

Yan

. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. In: The eleventh international conference on learning representations, 2023.

24.

Chen

Yoder

, et al. Tsmixer: An all-mlp architecture for time series forecasting. arXiv preprint arXiv:2303.06053, 2023.

25.

Zhou

Wen

, et al. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In: International conference on machine learning, 2022, pp.27268–27286. PMLR.

26.

Liu

Wang

Vaidya

, et al. Kan: Kolmogorov-arnold networks. arXiv preprint arXiv:2404.19756, 2024.

27.

Han

Zhang

, et al. Kan4tsf: Are kan and kan-based models effective for time series forecasting? arXiv preprint arXiv:2408.11306, 2024b.

28.

Srivastava

Hinton

Krizhevsky

, et al. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 2014; 15: 1929–1958.

29.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. Adv Neural Inf Process Syst 2017; 30: 5998–6008.

30.

Ahmed

, et al. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 2024; 568: 127063.

31.

Gehring

Auli

Grangier

, et al. Convolutional sequence to sequence learning. In: International conference on machine learning, 2017, pp.1243–1252. PMLR.

32.

Hendrycks

Gimpel

. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.

33.

Luo

Xie

, et al. Softs: Efficient multivariate time series forecasting with series-core fusion. arXiv preprint arXiv:2404.14197, 2024.

34.

Huang

Sun

Liu

, et al. Deep networks with stochastic depth. In: European conference on computer vision, 2016, pp.646–661. Springer.

35.

Das

Kong

Leach

, et al. Long-term forecasting with tide: Time-series dense encoder. arXiv preprint arXiv:2304.08424, 2023.

36.

Liu

, et al. Timesnet: Temporal 2D-variation modeling for general time series analysis. arXiv preprint arXiv:2210.02186, 2022.

37.

Liu

Zeng

Chen

, et al. Scinet: Time series modeling and forecasting with sample convolution and interaction. Adv Neural Inf Process Syst 2022; 35: 5816–5828.

38.

Lundberg

Lee

. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst 2017; 30: 4768–4777.