Abstract
Autonomous anomaly detection in multivariate time-series is essential for ensuring safety in industrial and aerospace systems, yet is hindered by high-dimensional signals and limited fault samples. This paper proposes a self-supervised autoencoder that integrates multi-scale convolutions (kernel sizes 3/5/7) with a Transformer encoder augmented by T5-style relative position bias, enabling joint modeling of local transient patterns and long-range temporal dependencies. To enhance representation learning under scarce anomaly labels, we leverage SimCLR-based contrastive pretraining to refine normal-data feature distributions. To address fragmented detections, we design an event-level aggregation strategy that merges window-level reconstruction errors via Top-k peak averaging and rule-based filtering. Evaluated on the SMAP benchmark and our in-house industrial dataset, named CAFUC2, our method achieves event-level F1 of 0.97 and 0.83, respectively outperforming traditional baseline methods and demonstrating robust generalization to real-world flight anomalies.
Keywords
Introduction
Background
Time-series data are widely applied in industrial manufacturing, aerospace, and financial risk control, serving as a foundation for monitoring equipment status, operational performance, and business indicators (Chevtchenko et al., 2023; Elía & Pagola, 2024; Hilal et al., 2022). In the Industrial Internet of Things (IIoT), tens of thousands of sensors continuously generate large-scale, dynamic, and heterogeneous streams, where timely anomaly detection is critical for preventing equipment failures and network incidents (Chatterjee & Ahmed, 2022; Chevtchenko et al., 2023). In aerospace, multivariate telemetry sequences must be monitored for anomalies to predict potential failures and ensure mission safety (Fejjari et al., 2025). In financial markets, anomalies in transactions or indicators may signal fraud or market disturbances, requiring prompt detection for effective risk control (Hilal et al., 2022). Therefore, reliable anomaly detection in time-series data has become a key requirement across safety-critical domains.
Multivariate anomaly detection is more complex than univariate detection, as it involves high-dimensional, heterogeneous data with intricate correlations and temporal dependencies (Wang et al., 2025). Recent studies emphasize that two core challenges are capturing inter-variable dependencies and coping with the scarcity of labeled anomalies. Attention-based autoencoders and hybrid convolution–Transformer models have demonstrated the ability to model both local and global structures without reliance on labels (Liu et al., 2024; Vaswani et al., 2017). However, single-scale convolutions in existing hybrid models fail to capture heterogeneous anomalies (e.g., short-term spikes and long-term drifts in flight telemetry), while vanilla Transformers lack inductive bias for local pattern extraction. This motivates the fusion of multi-scale convolutions and position-aware self-attention to achieve complementary strengths in local and global modeling.
At the same time, self-supervised learning, particularly contrastive pretraining, provides a theoretical basis for mitigating class imbalance by enforcing representation consistency across augmented views of normal data (Chen et al., 2020). This enhances the ability to detect rare or unseen anomalies during deployment. SimCLR mitigates the limited availability of anomaly samples through two complementary mechanisms: Contrastive invariance: by maximizing the consistency between augmented views (e.g., cropping, jittering) of normal data via Normalized Temperature-scaled Cross Entropy (NT-Xent) loss, the encoder learns compact, class-discriminative embeddings that reduce overfitting to normal-data noise; Latent-space compactness: the pretrained encoder maps normal samples to dense clusters in latent space, such that rare anomalies—unseen during pretraining—tend to be separated due to larger embedding distances. This provides an unsupervised mechanism to alleviate the imbalance between normal and anomalous samples.
To address the above challenges, this paper introduces a self-supervised anomaly detection framework, with the following contributions: Core Contribution: Multi-Scale Convolutional Autoencoder. We design a multi-scale convolutional autoencoder (MSCA-Encoder) that fuzes 1D convolutions (kernel sizes 3/5/7) with Transformer self-attention augmented by T5-style relative position bias. Unlike existing hybrid models (e.g., Anomaly Transformer Xu et al., 2021, which highlights temporal association discrepancies, or vanilla Transformer AE lacking local inductive bias), this architecture achieves complementary strengths: multi-scale convolutions capture heterogeneous anomalies (short-term spikes from hard landings, long-term drifts from sensor degradation), while T5-style position bias enhances the Transformer's ability to model temporal order—enabling joint learning of local transient patterns and long-range dependencies. Contrastive Pretraining for Scarce Anomalies. We propose an end-to-end integration of SimCLR-based pretraining with MSCA-Encoder. Specifically, the encoder is first pretrained on normal data via NT-Xent contrastive loss to learn compact and discriminative representations, and then the full autoencoder (encoder + symmetric decoder) is fine-tuned with reconstruction loss. This strategy mitigates the scarcity of labeled anomalies by reducing overfitting to normal-data noise. Event-Level Aggregation for Coherent Detection. We develop a two-stage aggregation strategy to address granularity inconsistency in window-level anomaly scoring: (i) Window-level scoring: compute reconstruction errors for each sliding window and aggregate the top-3 error peaks to suppress isolated noise; (ii) Event-level aggregation: merge adjacent anomalous windows (gap ≤ window length L = 50) and filter short events (<50 time steps) via a 90% quantile rule. This strategy significantly reduces fragmented alarms compared to fixed-window thresholding methods (e.g., LSTM-NDT Hundman et al., 2018). Extensive Validation on Benchmark and Industrial Data. We evaluate MSCA-AD on two datasets: (i) SMAP benchmark (NASA telemetry data)—achieving event-level F1 = 0.97, outperforming traditional baselines such as OmniAnomaly and LSTM-VAE; (ii) our in-house industrial dataset, named CAFUC2 (flight telemetry with four anomaly scenarios including Mixed Faults and Hard Landings)—achieving an average event-level F1 = 0.83, demonstrating robustness and adaptability to real-world aerospace anomalies.
Recent advances in multivariate time-series anomaly detection have leveraged deep generative and attention-based architectures to capture both local patterns and global dependencies in complex, high-dimensional data.
Variational autoencoder–based methods. OmniAnomaly employs a stochastic recurrent neural network that integrates GRU cells with a variational autoencoder to learn probabilistic latent representations of normal time-series behavior, enabling robust anomaly scoring via reconstruction probability (Su et al., 2019). LSTM-VAE extends this line by combining recurrent encoders with variational latent variables to improve sequence reconstruction, though it is prone to underestimating rare, long-lasting anomalies (Park et al., 2018).
Probabilistic mixture–based approaches. DAGMM fuzes an autoencoder with Gaussian mixture modeling in the latent space, thereby combining feature compression with probabilistic density estimation. While effective for compact representations, it often struggles on high-dimensional multivariate streams with long-range dependencies (Zong et al., 2018). LSTM-NDT instead adopts a forecasting-based paradigm, where an LSTM network predicts future values and anomaly scores are derived from prediction residuals (Hundman et al., 2018). This method improves interpretability but may suffer from error accumulation over long sequences.
Adversarial and attention-based models. USAD amplifies the reconstruction error of anomalous instances via a dual autoencoder trained adversarially, but suffers from training stability issues (Audibert et al., 2020). Recently, Anomaly Transformer replaces the standard self-attention mechanism with a dual-branch anomaly attention mechanism that explicitly quantifies the discrepancy between prior association and sequence association, thereby revealing temporal deviations (Xu et al., 2021). While it achieves state-of-the-art point-wise accuracy, the lack of explicit event-level aggregation may lead to alarm fragmentation when anomalies span multiple subsequences; its performance relies on a specialized “association discrepancy” mechanism, and the complex attention discrepancy calculations result in high computational/memory costs in long-sequence or high-frequency sliding window scenarios. In contrast, Our method simultaneously captures local and long-range features through multi-scale convolution + self-attention with relative position bias, and combines Top-3 peak aggregation, event-level merging strategy, and SimCLR contrastive pre-training to improve event-level interpretability and engineering efficiency while reducing hyperparameter sensitivity and computational burden.
Other generative paradigms. Deep SVDD adapts one-class support vector data description to deep networks by learning hyperspherical decision boundaries around normal data (Ruff et al., 2018). Normalizing flow–based methods, such as AFNF, combine flow-based density estimation with separable attention modules to provide exact likelihood scores (Guan et al., 2023). GAN-based models (e.g., MAD-GAN) use adversarial sequence generation and discrimination to capture complex temporal correlations (Li et al., 2019). Self-supervised contrastive learning frameworks like SimCLR enhance representation quality by maximizing agreement between augmented views, improving robustness in anomaly detection (Chen et al., 2020).
Building on SimCLR-style contrastive pretraining, recent studies have expanded broader self-supervised paradigms for time-series representation learning. TimeGPT introduced a foundation model pretrained on millions of cross-domain time-series, demonstrating that large-scale temporal pretraining can generalize to forecasting and anomaly detection with minimal fine-tuning (Vishwas & Macharla, 2025). The TS2Vec families validate the effectiveness of hierarchical contrastive for universal sequence embeddings (Yue et al., 2022). FEAT proposed a feature-aware temporal encoder that jointly models inter-variable relations and temporal structure, boosting multivariate robustness under limited supervision (Kim et al., 2023). More recent diffusion- or generative-based self-supervised methods (e.g., Diffusion Auto-regressive Transformer Wang et al., 2024) further extend temporal pretraining into probabilistic generative regimes. Collectively, these works highlight a growing trend toward general-purpose, pretrain–finetune frameworks for time-series anomaly detection—aligning with MSCA-AD's design philosophy of leveraging self-supervised contrastive pretraining to enhance downstream reconstruction-based detection.
Taken together, the aforementioned approaches exhibit complementary strengths in capturing local irregularities, long-range dependencies, and probabilistic uncertainty. However, most methods emphasize either short-term or long-term patterns, or focus solely on window-level scoring. In contrast, our approach couples a multi-scale convolutional encoder with Transformer layers enhanced by relative position bias, and explicitly performs event-level aggregation to merge window-level evidence into coherent anomaly events. This combination improves sensitivity to both short transient spikes and long-term drifts, while reducing fragmented detections in practical deployments. A concise comparison of representative methods and their main characteristics is provided in Table 1.
Comparison with Representative Anomaly Detection Methods.
Comparison with Representative Anomaly Detection Methods.
Overall Structure
Figure 1 presents our three-stage anomaly-detection pipeline in a single glance. We begin by standardizing the raw multichannel SMAP signals with both global and channel-wise Z-score normalization, then segment them into overlapping sliding windows to form the training and test sets. In the training stage, three baseline autoencoders (ConvAE, TransformerAE and Conv + Attention AE) are trained under the same optimization and dynamic thresholding scheme. schedule with dynamic thresholding; the best performer is chosen and its encoder is replaced by our multi-scale Conv + Attention design, which is further refined via SimCLR self-supervised pretraining: each sliding window is augmented into two views, and the encoder learns view-invariant representations through the NT-Xent loss, producing compact, discriminative latent features. At inference time, we compute the reconstruction error for every window, apply a dynamic percentile threshold, average the top three error peaks, and merge adjacent detections using a 90% quantile rule to produce high-precision anomaly calls.

Overall workflow of the proposed multivariate time-series anomaly detection method.
The SMAP dataset consists of 54 multivariate sensor channels, covering both normal operations and abnormal events with annotated time intervals. The dataset is divided into training and test subsets.
Let
Where
In our experiments, Z-score normalization is essential to mitigate scale differences across sensors and to stabilize the training of neural networks. By centering each channel at zero with unit variance, we ensure that all features contribute comparably to the learning process and prevent channels with large absolute readings from dominating the gradient updates.
After normalization, we segment the sequence into overlapping windows using a sliding window approach. We set the window length to
This procedure produces a window set
Each window

Sliding-window segmentation of the input sequence.
Our encoder consists of two components: multi-scale convolution and Transformer self-attention. Given an input window
Then, we concatenate the outputs along the channel dimension:
Here,
To concatenate multi-scale features without temporal distortion, we apply symmetric zero-padding before convolution. Specifically, for kernel size
Next, the feature map
Where
To compute the attention score between position
Attention weights are then obtained by applying the softmax over the key dimension:
While convolution provides strong inductive bias for local patterns, its receptive field is still limited. Self-attention complements this by building direct connections between any two time steps in the window, allowing the network to model long-range dependencies—crucial for detecting anomalies that depend on the broader context (e.g., a subtle drift only anomalous relative to remote history).
We fuze these two representations via a residual connection and layer normalization:
The residual shortcut ensures that convolutional features are preserved even if attention weights are small, and layer normalization stabilizes training by keeping the combined features on a consistent scale.
The decoder mirrors the encoder structure using a symmetric 1D convolutional network (or optionally an MLP), and produces a reconstruction
In our experiments, this combined conv–attention encoder achieved significantly lower reconstruction error on normal windows while amplifying errors on anomalous windows, leading to improved F1 over single-scale or purely convolutional/transformer baselines. Figure 3 illustrates the overall architecture of the proposed encoder.

Overall architecture of the proposed encoder.
To overcome the scarcity of labeled anomalies and learn robust representations from abundant unlabeled telemetry data, we pretrain the entire multi-scale convolution + Transformer encoder using a SimCLR-style contrastive objective. This integration ensures that both the convolutional layers (capturing local variations at different temporal resolutions) and the attention layers (capturing long-range dependencies) are jointly optimized to produce invariant and discriminative representations. In contrast to applying contrastive learning on a standalone encoder, our approach aligns the pretraining objective directly with the architecture used in downstream anomaly detection.
Augmentation Strategy. For each input window Random cropping: sample a contiguous sub-window of length Temporal jittering: add Gaussian noise Scaling: multiply the entire window by a random factor drawn from
These augmentations preserve the core normal patterns while creating diverse positive pairs for contrastive learning.
Contrastive Loss. Let
We adopt the NT-Xent loss (Chen et al., 2020) to enforce invariance across augmented views. For a positive pair
Training Protocol. Pretraining is performed for 100 epochs using the Adam optimizer (learning rate
Benefits and Ablation. Through this design, SimCLR aligns the convolutional and attention components of the encoder to produce perturbation-invariant yet discriminative features. During fine-tuning, these pretrained features provide a stable representation space where reconstruction errors concentrate around abnormal segments. As illustrated in Figure 4, two augmented views of each input window are encoded and projected.

SimCLR-based self-supervised pretraining pipeline. Two augmented views of the same window are encoded and projected, and the NT-Xent loss pulls positive pairs together while pushing apart other samples.
During the SimCLR pretraining phase, the NT-Xent loss enables the encoder to learn feature representations that are “robust to perturbations in normal data and sensitive to anomalous patterns”. This not only provides high-quality initialized weights for fine-tuning but also offers dual support for accurate anomaly detection:
On one hand, SimCLR pretraining optimizes the starting point and process of fine-tuning through three core mechanisms: Representation compactification: Treating different data augmentations of the same normal sample as positive pairs, the encoder is encouraged to map normal samples into denser clusters. This forms a normal sample manifold with clear boundaries in the reconstruction space, providing an explicit reference for subsequent anomaly discrimination. Noise invariance: Through augmentation consistency constraints, interference from harmless perturbations (e.g., jitter, scaling) on the encoded representations is suppressed. This allows the reconstructor during fine-tuning to avoid fitting random noise in normal data and focus more on the reconstruction deviations of anomalous signals. Accelerated convergence and overfitting resistance: Pretraining drives the encoder parameters to an initial state conducive to generalization, significantly reducing the magnitude of parameter updates during fine-tuning. This not only shortens the convergence cycle but also decreases the sensitivity to thresholds/hyperparameters on small-scale validation sets.
On the other hand, The pretrained multi-scale convolution + Transformer encoder provides feature representations that are both perturbation-invariant and temporally discriminative. During fine-tuning, convolutional layers emphasize localized deviations in telemetry signals, while Transformer attention highlights long-range contextual inconsistencies. This division of labor, already shaped by contrastive pretraining, enables the reconstruction loss to focus on structured anomalies rather than benign perturbations. As a result, errors become more concentrated on abnormal segments, forming a robust basis for subsequent Top-k aggregation.
After self-supervised pretraining, we fine-tune the encoder in an end-to-end autoencoder for anomaly detection. Specifically, we attach a symmetric decoder and optimize the per-window reconstruction loss:
At inference time, each window's anomaly score
To reduce sensitivity to isolated spurious spikes, we aggregate the top-
This Top-k strategy naturally complements SimCLR pretraining: by averaging only the largest errors, it emphasizes persistent deviations that the pretrained encoder cannot reconcile with normal patterns, while suppressing transient noise that contrastive training has already rendered invariant. Hence, the joint effect is a scoring mechanism that highlights true anomalies without being distracted by minor perturbations. A dynamic threshold

Fine-tuning and Top-k scoring pipeline. The pretrained encoder is coupled with a decoder, reconstruction errors are computed, and the Top-k peak averaging produces a robust anomaly scores.
To transform window-level scores into coherent anomaly events, we apply a simple gap-based merging and length filtering: Thresholding: Mark window Gap-based merging: Consecutive anomalous windows whose start-to-start distance Length filtering: Discard any merged event shorter than L time steps to suppress isolated noise.
For evaluation, we adopt the overlap-based criterion of Hundman et al. (2018). Each predicted event is compared to ground-truth anomaly intervals: if it overlaps any true interval, it counts as a True Positive (TP); otherwise, as a False Positive (FP). Any ground-truth interval not overlapped by any prediction is a False Negative (FN).
Precision, Recall, and F1 are then computed at the event level:
This evaluation better reflects real-world requirements by rewarding the detection of contiguous anomalysegments and penalizing fragmented or spurious alerts.
Experimental Results and Discussion
This section presents the detailed implementation of the proposed unsupervised anomaly detection framework based on a multi-scale convoluteional and self-attention autoencoder, along with the experimental settings and baseline comparisons.
Experimental Setup
Datasets and Preprocessing
We evaluated our method on two datasets: the SMAP dataset and our internal CAFUC2 industrial dataset.
For SMAP, we normalized the 25 multivariate telemetry channels using Z-score normalization and then partitioned them into overlapping windows of length
Sliding-Window Statistics of the SMAP Dataset.
Sliding-Window Statistics of the SMAP Dataset.
The CAFUC2 dataset was collected from 10 civil aviation training aircraft during actual flight missions. It contains over 50 heterogeneous telemetry channels, including positional data (latitude, longitude, altitude), attitude information (pitch, roll, yaw angles), and engine performance metrics (RPM, fuel flow, exhaust gas temperature). The signals are sampled at 1 Hz.
Implementation and hyperparameters are summarized in Tables 3 and 4. Table 3 details our hardware platform and model architecture. Table 4 reports the settings for SimCLR pretraining, downstream fine-tuning, and the Top-3 scoring and thresholding strategies (including window and event-level thresholds).
Hardware and Model Architecture.
Training Stages, Settings, and Thresholding.
To quantify the computational cost of MSCA-AD, we analyze both its training and inference complexity. The multi-scale convolutional encoder employs three parallel 1-D convolution branches (kernel sizes 3/5/7, input channels = 25, output = 32 each), resulting in approximately O(3 × L × C_in × C_out) ≈ 1.2 × 106 multiply-accumulate (MAC) operations per window of length L = 50.
The Transformer component consists of three encoder layers (hidden dimension = 128, 8 heads), each contributing roughly O(L2 × d_model) ≈ 3.2 × 105 MACs per layer.
Overall, the full MSCA-AD forward pass requires approximately 4.2 × 106 FLOPs per window—comparable to a 3-layer ConvAE, and significantly lighter than the Anomaly Transformer (≈8.7 × 106 FLOPs).
On an NVIDIA T4 GPU, SimCLR pretraining for 20 epochs over 135k SMAP windows required approximately 2.8 GPU-hours, while fine-tuning took about 2.5 GPU-hours, leading to a total training cost of ≈5 GPU-hours. During inference, the average per-window latency was 1.8 ms.
Module Ablation Study
Starting from the Conv + Attention AE baseline (F1 = 0.37), we incrementally add each component to quantify its contribution (Table 5).
Event-Level F1 of Model Variants. Relative Gain Over Conv + Attention AE Baseline.
Event-Level F1 of Model Variants. Relative Gain Over Conv + Attention AE Baseline.
First, integrating multi-scale convolutions with 3/5/7 kernels improves the F1 to 0.42 (+13.5%). Next, incorporating the T5-style relative positional bias into the attention score computation further raises performance to 0.48 (+29.7%).
These results indicate that while multi-scale convolution captures complementary temporal patterns, relative position bias provides additional temporal order information that enhances anomaly discrimination.
Applying Top-3 Peak Aggregation (mean of Top-3 window errors), gap merging
Top-k Aggregation. We vary

F1 sensitivity to kernel-set and channel count: single, two, three scales.
Top-k Sensitivity on SMAP (
Sliding-Window Length and Stride Sensitivity. We conduct ablation studies on the sliding-window length
Impact of Window Length
We also analyze the effect of stride s on detection and computational cost (Table 8). Smaller strides provide finer-grained coverage, capturing short anomalies that span window boundaries, while larger strides reduce computation at the expense of recall. Based on this study,
Stride Sensitivity Analysis on SMAP Telemetry.
To study convolutional design choices, Figure 6 plots event-level F1 for varying channel counts (8, 16, 32, 64, 128) across single-, two-, and three-scale kernel sets. Three-scale
Noise and Missing-Data Robustness
To evaluate robustness, we conducted controlled experiments under two perturbation settings: (1) Gaussian noise injection with standard deviations
These results suggest that MSCA-AD maintains stable event-level detection accuracy up to moderate noise (
Event-Level Threshold Sensitivity
We vary the event-merge percentile

Sensitivity of Event-Merge Percentile
At lower
We assess event-level performance using Precision, Recall, and F1. Table 11 compares our best configuration (Multi-Scale + SimCLR + Top-3) against representative baselines from the literature. Baseline results are taken from the original publications and not re-implemented.
Noise Robustness of MSCA-AD Under Gaussian Perturbations on the SMAP Dataset.
Noise Robustness of MSCA-AD Under Gaussian Perturbations on the SMAP Dataset.
Missing-Data Robustness of MSCA-AD Under Random Channel Dropout on the SMAP Dataset.
Event-Level Detection Performance Comparison.
Compared to OmniAnomaly (Su et al., 2019), which uses a stochastic recurrent VAE to model latent distributions, our multi-scale convolutions yield richer local features and SimCLR pretraining produces more discriminative embeddings—resulting in +20 pp precision and +13 pp F1 improvement. Against LSTM-VAE (Park et al., 2018), our self-attention encoder captures long-range dependencies missed by pure recurrence, boosting recall by 25 pp. DAGMM (Zong et al., 2018) jointly models clustering and reconstruction but lacks explicit temporal context mechanisms; our Top-3 peak averaging captures persistent error signals, improving F1 by 26 pp. Finally, LSTM-NDT (Hundman et al., 2018) applies nonparametric thresholding on LSTM forecasts—while balanced, it achieves F1 = 0.89 versus our 0.97, demonstrating the strength of combining self-supervised pretraining with robust aggregation. Anomaly Transformer (Xu et al., 2021) also reports F1 = 0.97, but relies on a specialized discrepancy-based attention mechanism, whereas our approach attains the same accuracy through a simpler multi-scale and contrastive learning design that is easier to integrate and interpret.
These results confirm that integrating multi-scale feature extraction, contrastive self-supervision, and peak-based event aggregation is essential for state-of-the-art unsupervised multivariate time-series anomaly detection.
To evaluate MSCA-AD's generalizability, we tested it on four representative anomaly scenarios from the CAFUC2 dataset, each comprising a normal flight segment followed by a labeled anomaly: (1) Mixed Faults, (2) Hard Landing, (3) Aggressive Pitch Maneuver, and (4) Dolphin Jump. Figure 8 visualizes the 3D trajectories with detected anomalies highlighted in red, and Tables 12–15 reports the corresponding event-level Precision, Recall, and F1.

3D Flight Trajectories for Four Anomaly Scenarios. Detected anomalies are indicated by the non-smooth trajectory segments: (a) Mixed Faults, (b) Hard Landing, (c) Aggressive Pitch Maneuver, and (d) Dolphin Jump.
Mixed Faults: Performance Comparison.
Hard Landing: Performance Comparison.
Aggressive Pitch Maneuver: Performance Comparison.
Dolphin Jump: Performance Comparison.
We evaluate MSCA-AD against representative baselines on four flight anomaly scenarios from the CAFUC2 dataset. Each table presents Precision, Recall, F1, and AUROC for the corresponding scenario.
Across all four scenarios, MSCA-AD consistently attains the highest F1 and AUROC, demonstrating improved robustness in anomaly localization and classification. Although the Anomaly Transformer also performs strongly, our method's integration of multi-scale convolution, T5-style relative positional bias, and SimCLR-based pretraining enables more effective modeling of both short transients and long-term drifts, thereby reducing fragmented alarms in dynamic flight trajectories.
Across these scenarios, MSCA-AD achieves an average F1 of 0.83, consistent with our controlled-dataset experiments. Notably: Mixed Faults: Balanced precision and recall (0.82/0.80) indicate the model's ability to disentangle multiple simultaneous anomalies. Hard Landing: Highest F1 (0.85) reflects the efficacy of multi-scale convolutional filters in capturing the abrupt, high-magnitude shock signatures. Aggressive Pitch Maneuver: Slightly lower precision (0.78) but strong recall (0.82) suggest that Top-3 aggregation effectively identifies brief, high-amplitude deviations while maintaining false-alarm control. Dolphin Jump: Replicates high F1 (0.85), demonstrating that self-attention with relative positional bias and contrastive pretraining generalize to detect cyclic vertical oscillations.
The narrow F1 variance (0.80–0.85) across diverse anomaly types underscores MSCA-AD's robustness: its combination of local feature extraction, global context modeling, and event-level aggregation reliably identifies heterogeneous flight anomalies without scenario-specific tuning.
We presented MSCA-AD, an intelligent anomaly-detection framework for multivariate telemetry that unifies multi-scale convolution,self-attention with relative positional bias, Top-k peak aggregation,and self-supervised pretraining. On the NASA SMAP benchmark,MSCA-AD's event-level F1 steadily improves — from 0.37 (baseline autoencoders)to 0.48 (multi-scale + RelPos), 0.83 (Top-3 aggregation), 0.62 (masked-AE),and ultimately 0.97 (SimCLR) — demonstrating high detection accuracy and low false-alarm rates.
By fuzing local and global features and leveraging contrastive self-supervision, MSCA-AD realizes intelligent decision-making under uncertainty. Its rule-based event aggregation effectively consolidates anomaly windows into coherent events, enhancing both precision and recall.
However, the multi-stage training pipeline entails significant computational cost, and the framework's robustness to extreme noise, heterogeneous sensor modalities, and hyperparameter choices (e.g., mask ratio, k) requires further validation.
Future research will explore:
Integration with Time-Series Foundation Models: Recent large-scale pretrained models such as TimeGPT and Time-Series Transformer (TST) provide generalizable temporal embeddings learned from millions of sequences. Integrating such pretrained backbones with the MSCA-AD framework offers two potential advantages: (1) replacing the Transformer encoder in our model with a frozen or partially fine-tuned TimeGPT/TST module could further improve representation quality under scarce data regimes; and (2) the contrastive pretraining stage could be reformulated as domain-adaptive fine-tuning of pretrained temporal embeddings, reducing pretraining cost while retaining cross-domain generalization. We plan to explore this hybrid direction in future work, focusing on pretrained temporal representation reuse combined with our multi-scale convolutional fusion and Top-3 aggregation mechanisms.
Fuzzy-Logic Integration: Inspired by human perception of uncertainty in telemetry interpretation, a promising direction is to incorporate fuzzy logic into the anomaly aggregation process. One feasible approach is fuzzy-weighted Top-3 fusion, where each reconstruction-error peak is assigned a membership degree:
Lightweight, online deployment: Although this work prioritizes performance during the research phase, MSCA-AD possesses significant potential for engineering implementation. Firstly, the multi-scale convolutional computation mode facilitates pre-implementation of the structural sparsity of channels or kernels, architecture pruning, and low-rank factorization at the inference end, naturally accommodating model compression techniques (e.g., quantization, pruning). Secondly, SimCLR enables the encoder to learn more robust general temporal features, providing feasibility for an edge deployment strategy of “frozen encoder + lightweight decoder design,” thereby reducing inference latency by several times while maintaining detection performance. Furthermore, future development could combine incremental learning or online threshold adaptation mechanisms to achieve continuous learning and real-time alarm updates on resource-constrained devices.
Footnotes
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the General Aviation Safety Operation Intelligent Research Institute, Digital Operation Safety Solution and Demonstration for China’s Civil Aviation Pilot Training System, Digital Twin of Flight Scenarios for Accident Investigation, Integrated Digital Operation System for Civil Aviation Pilot Training, (grant number TD2025DZ06, ASSA2024/30, FZ2022ZX55, 24CAFUC08003).
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
