Cross-Subject Event-Related Potential Classification via Multi-View Based Contrastive Learning

Abstract

Background:

Event-related potentials (ERPs) provide implicit feedback and error-correction signals that are valuable for brain–computer interfaces (BCIs). However, models trained on source-domain subject data are vulnerable to inter-subject variability and acquisition noise, which substantially degrades generalization to unseen subjects.

Objective:

We propose a multi-view contrastive learning domain generalization (MVCLDG) method to improve cross-subject generalization in ERP recognition by jointly exploiting discriminative feature extraction and domain-invariant representation learning.

Methods:

MVCLDG employs a multi-view feature-extraction module that fuses raw electroencephalography with phase information derived from the Hilbert transform via multi-scale inception blocks, thereby capturing both amplitude and phase features. The model then applies domain-alignment and contrastive-learning constraints to reduce distributional discrepancy across domains, compact within-class representations, and enlarge between-class separability. The approach was evaluated on a public Error-Related Negativity (ERN) dataset and a self-collected semantic–syntactic violation dataset; performance was assessed in cross-subject settings, and ablation and visualization analyses were conducted to probe the contributions of components and neurophysiological interpretability.

Results:

MVCLDG outperformed baseline and representative domain generalization methods in cross-subject ERP recognition without requiring additional target-domain adaptation. Ablation experiments confirmed the effectiveness of each component. Eigen-Class Activation Maps visualizations indicate consistency between the model-attended electrodes and known neurophysiological scalp patterns, supporting both the model’s generalization mechanism and its biological interpretability.

Conclusions:

MVCLDG offers an effective strategy for integrating phase-aware multi-view feature mining with contrastive domain generalization, yielding improved and interpretable cross-subject ERP recognition. The method advances the feasibility of ERP-based closed-loop BCIs that generalize across users.

Impact Statement

This study introduces a novel multi-view contrastive learning domain generalization (MVCLDG) framework. By integrating raw electroencephalography (EEG) with Hilbert-transformed EEG as complementary views, MVCLDG captures joint amplitude–phase features and, via multi-view contrastive learning, extracts domain-invariant representations. Together, these mechanisms improve cross-subject generalization by integrating discriminative feature mining with domain-invariant representation learning. In addition, we constructed a Chinese semantic–syntactic violation event-related potential (ERP) dataset, thereby addressing a critical gap in language-related ERP resources. This work not only provides new directions for improving cross-subject ERP recognition but also lays the groundwork for generalizing ERP-based closed-loop brain–computer interfaces across users.

Keywords

BCI contrastive learning domain generalization ERP multi-view learning

Introduction

Brain–computer interfaces (BCIs) establish a direct communication pathway between neural activity and external devices by decoding brain signals into executable commands (Chaudhary et al., 2016). This technology enables the use of neural signals in applications such as neurorehabilitation, rehabilitation engineering, and emotion recognition (Nicolas-Alonso and Gomez-Gil, 2012). Owing to the noninvasiveness, cost-effectiveness, and high temporal resolution of electroencephalography (EEG) (Holz et al., 2013), most BCI systems are developed based on EEG signals.

The performance of traditional EEG-based BCI systems relies heavily on the user’s long-term training and ability to adapt, a process that is typically slow and inefficient. The core challenge is that such systems lack an efficient intrinsic feedback and correction mechanism, preventing them from optimizing in real time using error signals, as occurs in motor learning. ERPs are phase-locked EEG responses elicited by specific stimuli, characterized by a high signal-to-noise ratio and distinct features (Polich, 2007). When users perceive an error generated by the BCI system, the brain automatically produces specific ERP components, such as error-related negativity, without requiring any overt behavioral response. This neural response, which reflects the brain’s intrinsic mechanisms for conflict monitoring and error detection, offers a transformative opportunity for BCI systems to achieve closed-loop adaptive correction (Mak and Wolpaw, 2009).

With the rapid progress of deep learning, the adoption of deep neural networks in EEG analysis has opened new avenues for ERP research. Compared with conventional approaches, models such as EEGNet and Lightweight Multi-Dimensional Attention Network, which integrate convolutional neural networks (CNNs) and attention mechanisms, have demonstrated superior performance in ERP recognition (Miao et al., 2023; Lawhern et al., 2018). However, these general-purpose EEG model architectures remain limited in addressing two challenges that are specific to ERP signal recognition: 1.

Insufficient feature extraction: Compared with conventional EEG signals, valid ERP trials are shorter in duration, and informative content is often concentrated within a specific post-stimulus interval. Moreover, ERP signals exhibit phase-locking and waveform-specific properties. Generic architectures, such as EEGNet, primarily rely on spatial–temporal convolutions of raw EEG signals and fail to explicitly model and efficiently integrate ERP-discriminative features, including phase-related information and peak density.

High inter-subject variability: ERP signals exhibit substantial variability across individuals, which arises from differences in signal noise, cognitive state, and fatigue level. Such variability leads to feature distribution shifts between the source and target subject domains, thereby reducing the generalization performance of classification models trained on source subjects. Although models such as EEGNet demonstrate strong performance on within-subject data, their architectures are not explicitly designed to extract domain-invariant representations shared across subjects, which limits their generalization capability when confronted with cross-subject distribution shifts.

To overcome the limitations of existing ERP recognition methods in feature extraction and inter-subject variability, we propose a multi-view contrastive learning domain generalization (MVCLDG) approach that extracts domain-invariant features from two complementary views: raw EEG signals and Hilbert-transformed EEG (HT-EEG) signals. The main contributions of this work are as follows: 1.

We use raw EEG signals and HT-EEG signals as two input views to integrate phase information with amplitude information, thereby addressing the limitation of insufficient feature extraction in existing methods.

We develop the MVCLDG model for multi-view EEG signals, incorporating a multi-scale spatiotemporal feature extraction module and a contrastive learning-based domain-invariant feature extraction module to enhance cross-subject recognition performance.

We conduct an experiment involving the recognition of sentences with semantic and syntactic violations to construct the violation dataset, thereby addressing the lack of Chinese semantic–syntactic recognition datasets and further demonstrating the practical effectiveness of our recognition algorithm.

We analyze the effectiveness of the MVCLDG model by visualizing intermediate features and identifying task-relevant domain-invariant ERP components.

Related Work

ERP signal analysis

Extensive research on ERP detection has primarily focused on developing effective recognition algorithms (Lotte et al., 2007). Traditional approaches, such as support vector machines (Müller et al., 2018) and linear discriminant analysis (Lotte et al., 2007; Krusienski et al., 2008), have been widely applied in ERP analysis. These methods classify samples by constructing high-dimensional feature spaces, showing particularly strong performance on small datasets. However, as dataset size increases and user diversity grows, these traditional methods often suffer from performance degradation, especially in scenarios characterized by pronounced individual differences. To overcome these limitations, researchers have developed advanced algorithms. For instance, xDAWN-based methods enhance ERP features to improve the signal-to-noise ratio, thereby increasing classification accuracy (Barachant et al., 2012; Rivet et al., 2009). In addition, Riemannian geometry–based ERP classifiers exploit geometric representations to capture the intrinsic structure of the data, demonstrating promising results on complex datasets (Barachant et al., 2012; Congedo et al., 2017).

With the rapid advancement of deep learning, significant progress has been achieved in pattern recognition and feature engineering (LeCun et al., 2015). Deep neural networks, particularly convolutional neural networks, have been extensively used to extract temporal and spatial information from EEG signals, yielding remarkable results (Craik et al., 2019; Roy et al., 2019). Classical architectures such as Deep ConvNets employ deep spatiotemporal convolutional layers for EEG feature extraction (Schirrmeister et al., 2017). Building on this foundation, EEGNet introduced separable convolutions to design a compact yet efficient network, which has achieved excellent results across multiple experiments and has become a benchmark model in EEG processing (Lawhern et al., 2018). Subsequently, EEG-TCNet reduced the number of trainable parameters, enabling greater efficiency on resource-constrained devices (Ingolfsson et al., 2020). Eduardo et al. further proposed EEG-Inception, which integrates the Inception module from computer vision and achieved significant improvements in ERP recognition (Santamaría-Vázquez et al., 2020).

Despite these advances, unresolved challenges remain in ERP research, which continue to limit the performance of recognition tasks using such data. We focus on two key challenges: 1.

Compared with conventional EEG signals, ERP signals offer a higher signal-to-noise ratio but a substantially shorter effective duration, which requires efficient extraction of discriminative features within a limited temporal window. However, existing methods have not sufficiently leveraged the rich phase information, specific waveforms, and other attributes embedded in ERP signals.

ERP signals also exhibit pronounced cross-subject variability, arising from differences in noise levels, cognitive states, fatigue, and physiological conditions. Such factors cause domain shifts when models trained on source subjects are applied to target subjects, thereby substantially impairing generalization performance.

Multi-view learning

Multi-view learning (MVL) aims to exploit complementary information provided by different views of the same phenomenon. In MVL, data are typically represented as multiple views that integrate different modalities or feature sets (Zhang et al., 2020a). MVL was initially applied to multimedia data, which commonly originate from heterogeneous sources such as text, video, and audio or are characterized by multiple representations (e.g., time-domain and frequency-domain features) (Zhang et al., 2020b; Li et al., 2022; Tang et al., 2020). Existing studies indicate that fusing diverse views to leverage complementary information can improve model accuracy (Tao et al., 2020; Tang et al., 2023). This benefit has been demonstrated across tasks. For example, in text classification, the Multimodal Bitransformer model fuses textual and visual views to leverage image features associated with accompanying text (Kiela et al., 2020). In the image domain, Tian et al. proposed a multi-view contrastive learning framework called Contrastive Multiview Coding, which learns compact representations for raw images and optical flow to maximize the mutual information between different views of the same scene (Tian et al., 2020).

Recently, several studies have extended MVL to EEG signal analysis. Ye et al. constructed temporal and spectral views of EEG signals and exploited their complementarity to mine additional positive pairs, thereby improving contrastive learning performance (Ye et al., 2022). An et al. proposed an amplitude-temporal dual-view fusion method for temporal-feature learning in automatic sleep staging (An et al., 2024a). Zhao et al. investigated the relationship between temporal and time–frequency views and developed a self-supervised multi-view representation learning framework for sleep-stage classification (Zhao et al., 2024). However, existing multi-view approaches in the EEG domain have not been extended to ERP recognition. A robust view-generation and fusion strategy tailored to ERP signals is still lacking.

Domain generalization

Owing to the pronounced inter-subject variability in EEG signals, numerous studies have applied domain generalization (DG) approaches to cross-subject classification tasks. These approaches aim to learn domain-invariant representations from source domains and exploit neural networks to capture shared inter-subject features, thereby improving classification performance (Santamaría-Vázquez et al., 2019). Within existing DG research, Pan et al. employed prototype learning to extract domain-level feature prototypes, which were then used to predict outcomes for new subjects, achieving promising results in zero-shot learning (Pan et al., 2023; An et al., 2024b). Domain-adversarial neural networks (DANNs), which learn domain-invariant representations through adversarial training between a discriminator and a generator, have also demonstrated substantial utility in DG tasks. Li et al. introduced the bi-hemisphere DANN, which incorporates one global and two local domain discriminators to extract emotion-discriminative features specific to each hemisphere (Li et al., 2021b). More recently, contrastive learning has been applied to DG tasks. Shen et al. developed a CNN-based contrastive learning framework that outperformed existing DG methods in cross-subject emotion recognition, thereby demonstrating the effectiveness of contrastive learning strategies in DG, particularly in cross-subject scenarios (Shen et al., 2023).

As an unsupervised learning paradigm, contrastive learning seeks to derive more discriminative feature representations by maximizing similarity between positive pairs while simultaneously minimizing alignment with negative pairs. Contrastive learning has demonstrated strong performance across multiple domains, including computer vision (Chen et al., 2020), natural language processing (NLP; Devlin et al., 2019), and bioinformatics (Li et al., 2021a; Liu et al., 2021). Initial applications of contrastive learning in the EEG domain typically adopted frameworks originally designed for computer vision or NLP. A representative example is the work of Mohsenvand et al., who adopted the Simple Framework for Contrastive Learning of Visual Representations (SimCLR) framework (Chen et al., 2020) to model similarities among multiple augmented views of the same raw EEG data (Mohsenvand et al., 2020), achieving favorable performance across several downstream tasks, including sleep stage classification, clinical anomaly detection, and emotion recognition.

In the context of cross-subject EEG recognition, Shen et al., inspired by inter-subject correlation studies (Hasson et al., 2004; Dmochowski et al., 2012), proposed a CNN-based contrastive learning framework designed to extract domain-invariant features. Without relying on extensive external data, their method achieved state-of-the-art performance in cross-subject emotion recognition (Shen et al., 2023). Deng et al. proposed the multi-source contrastive learning transfer model, which leverages contrastive learning to capture interindividual variability and incorporates domain adaptation and multi-domain feature learning to enhance model generalization across subjects (Deng et al., 2024). Zhi et al. introduced a supervised contrastive learning-based domain generalization network for cross-subject motor imagery and motor execution decoding, which extracts both domain-invariant and class-relevant discriminative representations from multiple EEG frequency bands (Zhi et al., 2025). Collectively, these studies further underscore the potential of contrastive learning in cross-subject EEG signal recognition.

Methods

In this section, we present the proposed MVCLDG framework, as illustrated in Figure 1. First, the Hilbert transform is applied to the raw ERP signals to obtain HT signals, thereby capturing richer feature information. Next, we design an efficient dual-path, multi-scale spatiotemporal convolutional module to learn class-related information from both the raw EEG and HT-EEG views. Contrastive learning is then performed on the dual-view and fused-view features. The subsequent subsections provide a detailed description of each component.

FIG. 1.

An overview of the MVCLDG architecture, which comprises three main components: the signal processing block, the feature encoder, and the contrastive learning block. The signal processing block generates mini-batches from the raw signals for contrastive learning and applies the Hilbert transform to Raw-EEG to construct the HT-EEG view. The feature encoder extracts spatiotemporal representations from both views. Subsequently, the contrastive learning block learns domain-invariant representations. Finally, the network is jointly optimized using three objectives: the classification loss ( $L_{C}$ ), the domain alignment loss ( $L_{D A}$ ), and the contrastive learning loss ( $L_{C L}$ ) across different views. EEG, electroencephalography; MVCLDG, multi-view contrastive learning domain generalization.

Signal transformation

ERP exhibits phase-locking, that is, under specific stimulus conditions, participants’ EEG responses show a high degree of phase consistency (Helfrich and Knight, 2019). Given the importance of phase information for ERP recognition, we exploit phase as one of the bases for constructing multi-view representations. When using phase information, reliance on phase-synchronization indices (e.g., phase-locking value and phase-lag index) can overly simplify the information and may yield suboptimal performance (Moon et al., 2018; Wang et al., 2019). Therefore, we use both raw EEG and HT-EEG as two complementary input views to the network so as to exploit phase information more fully.

The Hilbert transform is a widely used method for extracting phase information from raw EEG. A two-view architecture enables us to retain amplitude information from the raw EEG while simultaneously deriving phase information from the HT-EEG view (Kim and Im, 2024). Specifically, the Hilbert transform is a mathematical operation that maps a real-valued signal to a signal that is orthogonal to the original. For a real-valued signal x(t), the Hilbert transform H(x(t)) is defined as:

H (x (t)) = \frac{1}{π} p . v . \int_{− \infty}^{+ \infty} \frac{u (τ)}{t- τ} d τ

(1)

where “p.v.” denotes the Cauchy principal value. Equivalently, the Hilbert transform can be expressed as the convolution of x(t) with the kernel

\frac{1}{π t}

, that is,

H (x (t)) = \frac{1}{π t} *x (t)

(2)

H(x(t)) is a complex-valued signal whose real part corresponds to the original signal x(t), while the imaginary part corresponds to the Hilbert-transformed signal (denoted here as x’(t)). The magnitude and argument (phase) of the complex signal H(x(t)) correspond to the instantaneous amplitude and phase, respectively, thus both amplitude and phase information of the original EEG signal are preserved. Unlike other approaches (e.g., wavelet transforms) that require a predefined time window, the Hilbert transform is data driven and computationally efficient, rendering it suitable for the real-time constraints of online BCI systems. By employing the Hilbert transform, our CNN is expected to capture both amplitude and phase features from EEG data, thereby improving the model’s capacity to extract class-discriminative features.

Multi-view contrastive learning domain generalization

Dual-view feature encoder

The ability of an EEG feature encoder to extract discriminative representations underpins robust performance in cross-subject scenarios. Motivated by the multi-branch design of Inception blocks, we design a dual-view feature encoder to learn discriminative representations. Specifically, the encoder comprises two parallel sub-encoders—one for the raw EEG view and one for the HT-EEG view. Each view is processed by a base feature encoder with an identical Inception-style architecture; the two independent extraction paths preserve view-specific information.

The dual-view design allows the model to retain amplitude information from the raw EEG while exploiting phase information provided by the HT-EEG view. In an Inception module, the input from the previous layer is routed through parallel paths consisting of multi-scale convolutions and pooling operations to produce features, which are subsequently concatenated into a single output. In MVCLDG, the raw EEG and HT-EEG views correspond to one another and are processed by parallel, Inception-like feature-extraction paths operating at the view level. Concretely, the feature-extraction pipeline comprises two EEG inputs (raw EEG and HT-EEG), parallel input paths, and a subsequent filter-wise concatenation stage. The Hilbert transform converts the real-valued raw EEG into a complex signal whose real part corresponds to the original waveform and whose imaginary part encodes the Hilbert-derived signal. We split the complex signal into two real-valued inputs: the raw EEG (real part) is used directly, and the HT-derived imaginary part is converted to a real representation (e.g., by taking the analytic signal’s imaginary component) to serve as the HT-EEG input. Each base feature encoder processes its respective raw EEG or HT-EEG input to produce output feature maps. Output feature maps from the two input paths are concatenated along the depth dimension; after this filter-wise concatenation, the depth dimension is effectively doubled.

Each base encoder comprises a multi-scale temporal convolutional block followed by a spatial convolutional block to extract spatiotemporal features. In the multi-scale temporal block, multiple branches are set up in parallel, each corresponding to a different temporal kernel size. Each branch employs a 2D temporal kernel of a different size—specifically (1, T/2), (1, T/4), and (1, T/8)—to capture both transient events and rhythmic oscillatory patterns. Each branch produces D feature maps. Following the temporal convolutions, depthwise separable convolutions are applied to learn spatial patterns across EEG channels. To aggregate spatial information across input channels, 2D convolutional kernels of size (C,1) are employed; channel grouping and weight constraints (cf. EEGNet) are used to promote spatial feature disentanglement. The outputs of the three branches are downsampled in time via average pooling with kernel size (1,4) and stride (1,4). All output feature maps are concatenated along the depth dimension to form the feature representation f, where $f \in R^{d^{*} C^{*} T / / 4}$ . Here $d$ denotes the depth (number of feature maps), $C$ the channel dimension, and $T / / 4$ the temporal dimension after pooling. The feature representation f is subsequently used by the downstream classifier and domain-generalization modules.

Domain-generalization module

To improve cross-subject performance, our model incorporates a domain-generalization module that learns class-related, subject-invariant features from ERP signals, thereby enhancing robustness to individual variability. The domain-generalization module comprises two components: a domain-alignment component and a contrastive-learning component, both intended to address distributional discrepancies between domains. The domain-alignment component minimizes distributional differences among source subjects, while the contrastive-learning component facilitates learning of domain-invariant representations.

The domain-alignment component is specifically designed to mitigate intersubject distributional variability. Unlike domain-adaptation approaches, which align source and target distributions using target data, domain generalization aims to achieve robustness without access to target-domain samples. Consequently, we perform alignment across pairs (or groups) of source subjects: the rationale is that representations that are invariant to source-domain perturbations are more likely to generalize to unseen target subjects (Zhou et al., 2023). To realize domain alignment, we integrate the Deep Correlation Alignment (CORAL) loss into our framework to align second-order statistics across distributions. Deep CORAL promotes domain-invariant feature extraction by aligning the covariance matrices of the learned feature representations. Let $f (x) \in R^{d}$ denote the feature representation produced by the network for input x, where d is the feature dimensionality. For two source domains s1 and s2, let $C_{s 1}$ and $C_{s 2}$ denote the empirical covariance matrices of their feature sets. The CORAL loss between s1 and s2 is defined as:

L_{CORAL} = \frac{1}{4 N^{2}} {| | C_{s 1} - C_{s 2} | |}_{F}^{2}

(3)

where N denotes number of source domains and

{| | \cdot | |}_{F}

denotes the Frobenius norm. By minimizing

L_{CORAL}

jointly with the other objectives, the network aligns second-order statistics across domains. With N source domains, we perform global alignment by averaging pairwise CORAL losses:

L_{D A} = \frac{1}{N \times (N - 1) / 2} \sum_{i = 1}^{N - 1} \sum_{j = i + 1}^{N} L_{CORAL} (C_{s i}, C_{s j})

(4)

Contrastive learning aims to obtain domain-invariant representations by pulling together representations of samples from the same class (positive pairs) while pushing apart representations from different classes (negative pairs) in the latent space. To this end, a feature projection head is inserted between the base encoder and the contrastive loss to better shape representations for downstream prediction. The projection head architecture popularized by SimCLR (a nonlinear Multi-Layer Perceptron (MLP)) is widely used because it helps the base encoder learn improved representations. However, because ERP datasets are relatively small, MLP-based projection heads tend to be over-parameterized and prone to overfitting. Motivated by modular architectural design, we implement the projection head using convolutional units analogous to those in the base encoder. The projection head comprises two stages: (1) multi-scale temporal convolutions with kernels (1,T/8), (1,T/16), and (1,T/32) and (2) a depthwise (channel-wise) convolution stage with kernel size (2D,1) to model relationships across channels. This design reduces parameter count while preserving spatiotemporal feature extraction capacity. Each convolutional stage is followed by batch normalization (for feature standardization), an ELU activation (for nonlinearity), and appropriate regularization (e.g., dropout) to mitigate overfitting. After projection, features are vectorized so that inner products in the projection space can be used to measure similarity. The projection head is discarded after training; at inference time, we use only the feature encoder and the classifier. In deployment, only the feature encoder and classifier C are used for ERP recognition.

Contrastive losses

The MVCLDG framework defines loss functions for the raw EEG view, the HT-EEG view, and the fused view. For each view, three objectives are jointly optimized: the classification loss $L_{C}$ , the domain-alignment loss $L_{D} A$ , and the contrastive-learning loss $L_{C} C$ .

The classification loss is the cross-entropy loss, which minimizes the discrepancy between predicted and ground-truth labels. For a batch of N samples, the cross-entropy loss is defined as:

L_{C} = - \frac{1}{N} \sum_{i = 1}^{N} y_{i} l o g (y_{i}^{'})

(5)

Here $y_{i}$ denotes the ground-truth label indicator for input $X_{i}$ , $y_{i}^{'}$ denotes the model’s predicted probability, and N is the batch size.

To construct effective batches for contrastive learning, we employ a stratified sampling strategy. Let the source domain consist of n subjects. For each subject, k samples are randomly and uniformly selected from each of the C experimental classes, resulting in a total batch size of N = n × C × k. This design ensures strict class balance within each batch and avoids sample duplication. The proposed strategy mitigates potential bias caused by class imbalance and encourages each batch to encompass diverse intersubject variations, thereby facilitating the learning of domain-invariant representations.

Following the batch construction strategy described above, positive and negative sample pairs for contrastive learning are formally defined. A positive pair is formed by two projection vectors $p_{si}^{+}$ and $p_{sj}^{+}$ that belong to the same class but originate from different subjects si and sj. Samples from different classes are treated as negatives. In this way, each positive pair is contrasted against a set of negative examples. Features extracted from the raw EEG view (denoted $F_{raw}$ ) are mapped through a nonlinear projection head to obtain projected representations $P_{raw}$ , which are used for the contrastive objective. We employ the NT-Xent loss (as in SimCLR) to define the contrastive loss:

l (s_{i}^{+}, s_{j}^{+}) = - \log \frac{\exp (sim (p_{si}^{+}, p_{sj}^{+}) / τ)}{\sum_{k=1}^{2N} 1_{[k \neq i]} \exp (sim (p_{si}, p_{sk}) / τ)}

(6)

L_{CL} = \sum_{i = 1}^{N} \sum_{Ci = Cj & j \neq i}^{N} [l (i, j) +l (j, i)]

(7)

In the above formula, $τ$ is the temperature hyperparameter and $1_{k \neq i} \in {0, 1}$ is an indicator that excludes the anchor itself from the denominator. $sim (u, v) = \frac{u^{T} v}{||u||||v||}$ is the cosine similarity function. The total loss $L_{raw}$ of the original EEG view is as follows:

L_{raw} = L_{C - raw} + λ_{1} L_{D A - raw} + λ_{2} L_{C L - raw}

(8)

The weight coefficients $λ_{1}$ and $λ_{2}$ are used to balance the contributions of each loss. Analogous computations are performed for the HT-EEG view and for the fused-view contrastive objective. The overall loss of the MVCLDG framework is as follows:

L_{total} {=L}_{raw} {+L}_{HT} {+L}_{fusion}

(9)

By jointly optimizing these objectives, the model converges toward representations that are both class-discriminative and domain-invariant.

Experiments and Results

This section evaluates the effectiveness of the proposed method on two ERP classification datasets, including one collected in-house. The experimental setup and results are described in detail.

Datasets

Dataset I (feedback error-related negativity)

In the Kaggle-ERN dataset, EEG signals were recorded from 26 healthy participants using 56 electrode channels placed according to the extended 10–20 system, at a sampling rate of 600 Hz (Margaux et al., 2012). During the experiment, participants were presented with a 6 × 6 matrix containing 36 alphanumeric characters to elicit P300 responses (Krusienski et al., 2008). This dataset is designed to identify neural responses associated with erroneous selections by analyzing post-feedback EEG signals. Each participant completed 340 trials. The median number of error-feedback trials was 115.5 (range: 20–199), whereas the median number of correct-feedback trials was 237.5 (range: 141–320), yielding a median error-to-correct trial ratio of 0.51 (range: 0.06–1.41). No demographic information (e.g., age or gender) is available in this public dataset. Prior to analysis, EEG data were band-pass filtered between 1 and 40 Hz using an Finite-Impulse Response (FIR) filter in EEGLAB (Delorme and Makeig, 2004), and then down-sampled to 128 Hz. EEG trials corresponding to correct and incorrect feedback were extracted from the [0,1.25]s time window following feedback presentation and used as features for a binary classification task.

Dataset II (Chinese semantic–syntactic violation ERP)

The Chinese semantic–syntactic violation ERP (CSSV-ERP) dataset was collected in-house. EEG signals were recorded from 12 healthy participants using 64 electrodes placed according to the 10–20 system, at a sampling rate of 500 Hz. As illustrated in Figure 2, Chinese sentences were presented to participants one word at a time. Each word was displayed for 400 ms (stimulus), followed by a 100 ms blank screen (fixation). Participants completed the task under three conditions (Zhu et al., 2022): (1) correct sentences (CORR), that is, well-formed subject–verb–object sentences; (2) local syntactic violation (SYN-P), in which a degree adverb that normally modifies an adjective is erroneously inserted before the object noun, rendering the phrase syntactically ill-formed; (3) semantic violation (SEM), in which the object conflicts with the verb’s selectional restrictions. This dataset aims to probe participants’ cognitive processing of grammatical and syntactic violations by analyzing EEG signals recorded after participants’ button presses. In BCIs that select characters by detecting users’ cognitive responses, semantic processing (N400) can serve either as an additional control signal or as a monitoring indicator of the appropriateness of system instructions. Each participant completed a total of 192 trials, with 64 trials per stimulus category. The final sample consisted of 10 males and 2 females aged 22–26 years, all of whom held a bachelor’s degree. Thirteen participants were initially recruited for this study. One participant was excluded due to excessive EEG artifacts, accounting for more than 25% of the total trials. Before analysis, EEG recordings were band-pass filtered between 0.1 and 60 Hz using an FIR filter in EEGLAB. Independent component analysis was applied to remove ocular and muscular artifacts, and the data were re-referenced to TP9 and TP10. After preprocessing, a median of 68.5 correct trials (range: 54–63), 61.5 semantic violation trials (range: 48–64), and 61.5 syntactic violation trials (range: 47–63) were retained per subject. EEG segments corresponding to semantic and syntactic violations were extracted from the [−0.15, 1] s time window relative to the critical word onset (first object noun position) and used as features for a multi-class classification task, with baseline correction applied using the −150 to 0 ms interval. Detailed information about the CSSV-ERP dataset can be found in the dataset card (Table 1).

FIG. 2.

Example stimuli from the semantic–syntactic violation paradigm. (a) Words highlighted in the corresponding colors indicate the keywords analyzed. Correct sentences (CORR, red); local syntactic violations (SYN-P, orange); semantic violations (SEM, blue). (b) Black bars indicate word presentation duration (400 ms); purple bars denote fixation duration (100 ms).

Table 1.

Dataset Card for Chinese Language Event-Related Potential Dataset

Item	Description
Purpose	To investigate the neural mechanisms underlying Chinese semantic (N400) and syntactic (P600) violations, providing benchmark data for developing language cognition–based brain–computer interfaces.
Subjects	12 healthy right-handed subjects (10 male, 2 female), aged 22–26 years, all holding bachelor’s degrees.
Paradigm and conditions	Single-word sequence presentation. Three conditions: correct sentences, local syntactic violation sentences, and semantic violation sentences, with 64 trials per condition per subject.
Acquisition parameters	• Device: Brain Products actiCHamp Plus • EEG cap model: Brain Products actiCAP slim/snap • Electrode montage: 64 channels, International 10–20 system • Sampling rate: 500 Hz • Key markers: S1/S2/S4 (stimulus presentation for correct/semantic violation/syntactic violation conditions), S7/S8 (correct/error response)
Preprocessing pipeline	1. 0.1–60 Hz band-pass filtering 2. ICA for removal of ocular and muscular artifacts 3. Re-referencing to TP9 and TP10 4. Epoch extraction [−0.15, 1] s, baseline correction based on [−0.15, 0] s interval
Recommended data split	Leave-one-subject-out cross-validation, to strictly evaluate cross-subject generalization capability.
Access	DOI: https://doi.org/10.5281/zenodo.18362005

EEG, electroencephalography; ICA, independent component analysis.

Experiment details

Network settings

The code was implemented in PyTorch (Paszke et al., 2019) and evaluated on an NVIDIA RTX 5000 GPU. The training objective combined cross-entropy loss, domain-alignment loss, and contrastive loss. The maximum number of training epochs was set to 100. The learning rate was selected via grid search from the set {1e-3, 1e-4, 1e-5}, with 1e-3 yielding the best performance. Similarly, the dropout rate was chosen from {0.1, 0.25, 0.5} and set to 0.5. During training, each mini-batch was required to include samples from every class of each source subject; therefore, the batch size was determined by the number of source-domain subjects and stimulus categories (batch size = number of source subjects × the number of stimulus categories × 2). The Adam optimizer was used for optimization, and early stopping was applied to reduce training time and prevent overfitting. The loss weighting coefficients $λ_{1}$ and $λ_{2}$ were selected from the set {1, 0.1, 0.01}, with the final values set to $λ_{1}$ = 1 and $λ_{2}$ = 0.1. The temperature parameter for the contrastive loss was set to 0.07. This value was selected via grid search from the candidate set {0.05, 0.07, 0.1, 0.2, 0.5}. The random seed was fixed at 42 to ensure reproducibility. Our implementations are available at https://doi.org/10.5281/zenodo.18405199.

Comparison models

To validate our method, we compared it with standard deep-learning baselines and representative domain-generalization techniques. For deep-learning baselines, we used EEGNet (Lawhern et al., 2018), EEG-Inception (Santamaría-Vázquez et al., 2020), EEG-Conformer (Song et al., 2023), and HiReNet (Kim and Im, 2024): EEGNet was chosen for its established role as a compact benchmark architecture for EEG, while EEG-Inception adapts Inception-style multi-scale parallel convolutions to EEG data. EEG-Conformer integrates convolutional architectures commonly used in EEG analysis with transformer-based modeling. Building upon a ResCNN backbone, HiReNet incorporates raw EEG signals and their Hilbert-transformed counterparts as dual inputs. For domain generalization methods, we evaluated Mixup (Zhang et al., 2018), maximum mean discrepancy (MMD) (Li et al., 2018b), invariant risk minimization (IRM) (Arjovsky et al., 2020), meta-learning for domain generalization (MLDG) (Li et al., 2018a), DANN (Ozdenizci et al., 2020), and SelfReg (Kim et al., 2021). Mixup performs convex interpolation between randomly selected sample–label pairs to synthesize training examples, thereby regularizing the decision boundary and improving cross-subject robustness. MMD minimizes distributional discrepancy between source domains in an Reproducing Kernel Hilbert Space (RKHS), encouraging domain-invariant latent representations without access to target data. IRM enforces a shared classifier across domains that depends only on features exhibiting a stable causal relationship with labels, thus mitigating spurious EEG artifacts. MLDG splits source domains into “meta-train” and “meta-test” partitions and uses bi-level optimization to promote rapid adaptation to unseen domains. DANN employ a domain-adversarial training framework to learn subject-invariant representations across individuals. SelfReg incorporates contrastive and other regularization terms during self-supervised pretraining to align cross-domain feature distributions while preserving discriminative information.

Training procedure

This study focuses on cross-subject domain-generalization scenarios for deep models. To assess effectiveness and generalization, we adopt the leave-one-subject-out (LOSO) protocol. In each LOSO round, data from one subject are held out as the test set, and data from all remaining subjects are pooled as source data; the pooled source data are split into 80% training and 20% validation subsets. All trials from any given subject were strictly confined to either the training, validation, or test set, thereby preventing subject-level information leakage. After training for the maximum number of epochs, the model achieving the best performance on the validation subset is selected for final testing on the held-out subject. We report the average cross-subject test performance across all LOSO rounds.

Results

We adopted the LOSO scheme, in which data from one subject were used for testing, while data from all other subjects were used for training the classifier, in order to evaluate the proposed method. The LOSO procedure resembles real-world cross-subject applications, where the model is trained on available data and validated on unseen test subjects. Since the number of target samples is much smaller than that of nontarget samples, the area under the Receiver Operator Characteristic (ROC) curve (AUC) was selected as the evaluation metric for binary classification. The AUC results on Dataset I and Dataset II are presented in Tables 2 and 3, respectively. For the three-class classification task on Dataset II, the one-vs-rest strategy is used to compute the multi-class AUC.

Table 2.

Classification Performance (Area Under the ROC Curve) on Dataset I (ERN)

Method	Deep learning				Domain generalization					Contrastive learning
Method	EEGNet	EEG-Inception	EEG-Conformer	HiReNet	Mixup	MMD	IRM	MLDG	DANN	SelfReg	Ours
sub1	0.5623	0.6752	0.6097	0.6891	0.5847	0.6612	0.6516	0.5847	0.6730	0.6959	0.6304
sub2	0.5397	0.5065	0.7383	0.5563	0.5189	0.4753	0.5239	0.5326	0.5125	0.519	0.6973
sub3	0.7082	0.703	0.7699	0.7294	0.7343	0.6775	0.6786	0.7428	0.7687	0.7131	0.8261
sub4	0.6872	0.725	0.705	0.6974	0.7018	0.6813	0.691	0.7191	0.6855	0.7307	0.754
sub5	0.5686	0.6056	0.5982	0.5918	0.5631	0.5882	0.5879	0.5889	0.6040	0.6099	0.5386
sub6	0.7585	0.7917	0.6645	0.7701	0.7748	0.7679	0.7705	0.7691	0.8189	0.7831	0.7986
sub7	0.5996	0.5641	0.6727	0.6179	0.6211	0.5295	0.5255	0.5818	0.6420	0.5752	0.6524
sub8	0.6917	0.6798	0.6598	0.6583	0.6892	0.6637	0.6639	0.6895	0.6450	0.687	0.7062
sub9	0.8068	0.7978	0.712	0.8329	0.7999	0.8074	0.8082	0.807	0.7958	0.8261	0.8394
sub10	0.6745	0.6621	0.6547	0.6219	0.6649	0.6663	0.6656	0.655	0.6467	0.6369	0.6658
sub11	0.6848	0.6825	0.6085	0.6567	0.6636	0.6314	0.6282	0.6579	0.6774	0.6802	0.7311
sub12	0.6717	0.7593	0.7798	0.8332	0.8253	0.7051	0.7226	0.8543	0.8342	0.8016	0.8611
sub13	0.8302	0.8403	0.8188	0.7901	0.8562	0.8197	0.8643	0.7755	0.8004	0.8504	0.8414
sub14	0.7596	0.7152	0.6271	0.693	0.7732	0.7371	0.7157	0.7759	0.7497	0.7422	0.7871
sub15	0.681	0.6945	0.5835	0.6366	0.6987	0.638	0.6395	0.7096	0.6602	0.7048	0.6862
sub16	0.5268	0.5334	0.5216	0.5367	0.5511	0.5898	0.591	0.5421	0.5990	0.543	0.5779
sub17	0.7030	0.6761	0.763	0.7336	0.6771	0.7062	0.6979	0.6667	0.7194	0.6887	0.7676
sub18	0.7526	0.8387	0.5832	0.7905	0.8001	0.7559	0.7043	0.7800	0.6998	0.8330	0.7586
sub19	0.8534	0.9067	0.9211	0.8615	0.8838	0.8575	0.8666	0.8950	0.9048	0.9117	0.9084
sub20	0.5158	0.5076	0.5247	0.5284	0.5445	0.5182	0.5095	0.5393	0.5697	0.5810	0.6112
sub21	0.7226	0.789	0.749	0.7747	0.8248	0.7329	0.6947	0.7979	0.8492	0.7732	0.7659
sub22	0.8120	0.7245	0.8918	0.7585	0.8124	0.8116	0.7569	0.8104	0.8280	0.7214	0.8739
sub23	0.8889	0.865	0.888	0.8957	0.9003	0.8899	0.903	0.9041	0.9052	0.8831	0.895
sub24	0.8766	0.8503	0.8377	0.8498	0.8412	0.8742	0.8072	0.8412	0.8813	0.8520	0.9016
sub25	0.7381	0.7463	0.6213	0.6709	0.7294	0.7356	0.6967	0.7288	0.5924	0.7595	0.7223
sub26	0.6872	0.7066	0.6549	0.7151	0.6976	0.6858	0.7018	0.6859	0.6907	0.7264	0.7078
Mean	0.7039	0.7133	0.6984	0.7112	0.7205	0.7003	0.6949	0.7167	0.7213	0.7242	0.7502
SD	0.1060	0.1082	0.1112	0.1023	0.1104	0.1081	0.1027	0.1093	0.1092	0.1044	0.1019

Values in bold indicate the best performance in each row.

DANN, domain-adversarial neural network; EEG, electroencephalography; IRM, invariant risk minimization; MLDG, meta-learning for domain generalization; MMD, maximum mean discrepancy; SD, standard deviation.

Table 3.

Classification Performance (Multi-Class Area Under the ROC Curve; Macro-F1) on Dataset II (Chinese Semantic–Syntactic Violation Event-Related Potential)

Method	Deep learning				Domain generalization					Contrastive learning
Method	EEGNet	EEG-Inception	EEG-Conformer	HiReNet	Mixup	MMD	IRM	MLDG	DANN	SelfReg	Ours
sub1	0.5398	0.5633	0.5177	0.5327	0.5464	0.5139	0.5441	0.5344	0.5985	0.5855	0.5277
sub2	0.5243	0.5213	0.5499	0.5007	0.5766	0.5563	0.5783	0.5592	0.5829	0.5818	0.6051
sub3	0.6919	0.7429	0.6441	0.6548	0.7152	0.6942	0.7303	0.7104	0.7430	0.6747	0.7219
sub4	0.5752	0.6619	0.5486	0.6748	0.5555	0.659	0.6559	0.5644	0.7162	0.7247	0.6908
sub5	0.5235	0.576	0.5011	0.5946	0.5513	0.5412	0.5427	0.5426	0.5956	0.6587	0.671
sub6	0.5713	0.5615	0.5506	0.5552	0.5563	0.5481	0.5539	0.5594	0.5919	0.5571	0.5822
sub7	0.5452	0.5116	0.5402	0.5929	0.6002	0.6082	0.6021	0.5906	0.5886	0.5794	0.622
sub8	0.6464	0.6298	0.6217	0.6121	0.5913	0.5908	0.5962	0.6103	0.6304	0.6623	0.6725
sub9	0.6149	0.6307	0.6076	0.6177	0.5838	0.6229	0.6162	0.6005	0.6652	0.6468	0.7006
sub10	0.6547	0.6975	0.6211	0.6264	0.6473	0.6349	0.6279	0.6819	0.6966	0.6994	0.7335
sub11	0.5862	0.58	0.5713	0.6241	0.6257	0.639	0.6484	0.6348	0.6614	0.6354	0.6951
sub12	0.5029	0.5232	0.5187	0.5326	0.547	0.5491	0.5356	0.5503	0.6086	0.5446	0.6045
Mean	0.5814	0.6000	0.5661	0.5932	0.5914	0.5965	0.6026	0.5949	0.6399	0.6292	0.6522
SD	0.0595	0.0734	0.0469	0.0529	0.0505	0.0553	0.0577	0.0561	0.0554	0.0583	0.0630
Macro-F1^a	0.3893	0.3814	0.3493	0.3757	0.3416	0.3553	0.3635	0.3600	0.3925	0.3833	0.4412

Values in bold indicate the best performance in each row.

The macro-F1 value represents the average macro-F1 score of the model across all subjects.

In Dataset I, the AUC values of the deep learning methods EEGInception and EEGNet were 0.7133 ± 0.1082 and 0.7039 ± 0.1060, respectively, demonstrating their ability to capture the spatial and temporal representations of ERP signals. By exploiting phase information derived from the Hilbert transform, HiReNet achieves an AUC of 0.7112 ± 0.1023. EEG-Conformer, which is based on a Transformer architecture and is optimized for modeling long-range dependencies, shows limited effectiveness on short-duration ERP signals, yielding an AUC of 0.6984 ± 0.1112. The Mixup and MLDG methods achieved AUCs of 0.7205 ± 0.1104 and 0.7167 ± 0.1093, respectively, outperforming the standard deep learning baselines and indicating that they can better learn domain-invariant features. MMD and IRM tend to perform poorly on small-sample datasets. Among the domain generalization methods, DANN achieves the highest performance, with an AUC of 0.7213 ± 0.1092. MMD enforces alignment of feature distributions across domains; under limited-sample conditions, this can over-regularize the learned representations and collapse class-specific structure, thereby blurring class boundaries. IRM is particularly vulnerable when sample sizes are small: high variance in domain-wise risk estimates can destabilize optimization and cause the model to converge to poor local minima. The contrastive learning–based approaches, SelfReg and MVCLDG, further improved the results, yielding AUCs of 0.7242 ± 0.1044 and 0.7502 ± 0.1019, respectively. Notably, MVCLDG achieved the highest average AUC of 0.7502 ± 0.1019, and paired-sample t-tests demonstrated a statistically significant performance improvement over existing methods (p < 0.05). This suggests that our method can extract essential domain-invariant representations more effectively through contrastive learning. Moreover, the reduced variance observed in contrastive learning methods indicates lower sensitivity to specific domain shifts. Overall, these experiments confirm the superiority of the proposed algorithm on public ERP datasets.

On our self-collected Dataset II, standard deep learning models remained suboptimal: Although EEGInception (AUC = 0.6000 ± 0.0734) outperformed EEGNet (AUC = 0.5814 ± 0.0595), both were surpassed by specialized domain generalization algorithms and our proposed approach. Owing to the smaller size of this dataset, the performance of EEG-Conformer degrades markedly. In domain generalization algorithms, the aforementioned drawbacks of MMD and IRM on small-sample datasets are once again manifested. MVCLDG achieved the highest average AUC of 0.6522 ± 0.0630 and the highest average macro-F1 score of 0.4412 ± 0.0644, representing an improvement over the strongest baseline method, DANN (AUC = 0.6399 ± 0.0554), and an even larger advantage compared with standard deep learning models. Importantly, this performance gain was achieved while maintaining relatively high stability.

To rigorously assess the performance gains of MVCLDG, paired-sample t-tests were conducted to compare it against each of the 10 baseline models. To account for errors arising from multiple comparisons, all p values were adjusted using the Holm–Bonferroni correction, and statistical significance reported throughout the article is based on the corrected values. In addition, we report the mean AUC differences ( $Δ$ AUC) together with their 95% confidence intervals, estimated via bootstrapping. The results are summarized in Tables 4 and 5.

Table 4.

Pairwise Statistical Comparisons Between Multi-View Contrastive Learning Domain Generalization and Baseline Models on Dataset I

Comparison	Mean difference (ΔAUC)	95% CI	t	Adjusted p
MVCLDG versus IRM	0.0554	[0.0337, 0.0778]	4.85	<0.001***a
MVCLDG versus EEG-Conformer	0.0518	[0.0286, 0.0754]	4.26	0.002**
MVCLDG versus mmd	0.0500	[0.0278, 0.0736]	4.13	0.002**
MVCLDG versus EEGNet	0.0463	[0.0280, 0.0665]	4.68	<0.001***
MVCLDG versus HiReNet	0.0391	[0.0210, 0.0573]	4.16	0.002**
MVCLDG versus EEG-Inception	0.0369	[0.0132, 0.0610]	2.89	0.024*
MVCLDG versus MLDG	0.0335	[0.0165, 0.0518]	3.64	0.006**
MVCLDG versus Mixup	0.0298	[0.0124, 0.0484]	3.12	0.018*
MVCLDG versus DANN	0.0289	[0.0083, 0.0497]	2.69	0.025*
MVCLDG versus SelfReg	0.0260	[0.0025, 0.0493]	2.16	0.041*

Note: *p < 0.05; **p < 0.01; ***p < 0.001.

AUC, area under the ROC curve; CI, confidence interval; DANN, domain-adversarial neural network; EEG, electroencephalography; IRM, invariant risk minimization; MLDG, meta-learning for domain generalization; MMD, maximum mean discrepancy.

Table 5.

Pairwise Statistical Comparisons Between Multi-View Contrastive Learning Domain Generalization and Baseline Models on the Dataset II

Comparison	Mean difference (ΔAUC)	95% CI	t	Adjusted p
MVCLDG versus EEG-Conformer	0.0862	[0.0616, 0.1116]	6.48	<0.001***a
MVCLDG versus EEGNet	0.0709	[0.0451, 0.0961]	5.17	0.002**
MVCLDG versus Mixup	0.0609	[0.0350, 0.0866]	4.31	0.006**
MVCLDG versus HiReNet	0.0590	[0.0384, 0.0777]	5.84	0.001***
MVCLDG versus MLDG	0.0573	[0.0349, 0.0807]	4.67	0.004**
MVCLDG versus MMD	0.0558	[0.0372, 0.0762]	5.44	0.002**
MVCLDG versus EEG-Inception	0.0523	[0.0255, 0.0783]	3.70	0.010*
MVCLDG versus IRM	0.0496	[0.0259, 0.0739]	3.89	0.010*
MVCLDG versus SelfReg	0.0230	[0.0014, 0.0408]	2.18	0.104
MVCLDG versus DANN	0.0123	[−0.0097, 0.0332]	1.08	0.303

Note: *p < 0.05; **p < 0.01; ***p < 0.001.

The results indicate that on the Dataset I, MVCLDG significantly outperforms all 10 baseline methods (corrected p < 0.05), with mean AUC improvements ranging from 0.026 to 0.055. On Dataset II, MVCLDG achieves statistically significant improvements over 8 of the 10 baselines. Notably, when compared with the strongest competitors, DANN and SelfReg, the average differences remain positive; however, their 95% confidence intervals include zero, and the corrected p values do not meet the significance criterion. This outcome highlights the inherent difficulty of obtaining further gains on this more challenging task.

To evaluate the performance of MVCLDG on the CSSV-ERP test set, we aggregated the LOSO test results across 12 subjects to construct the aggregated confusion matrix shown in Figure 3.

FIG. 3.

Aggregated confusion matrices of the MVCLDG model on the CSSV-ERP test set across 12 subjects under the leave-one-subject-out (LOSO) evaluation protocol. The left matrix shows the raw classification counts, while the right matrix presents recall rates normalized by the true classes for the three-class classification task. CSSV-ERP, Chinese semantic–syntactic violation event-related potential; MVCLDG, multi-view contrastive learning domain generalization.

Although MVCLDG achieved the best relative performance among all baseline methods, the confusion matrix in Figure 3 still reveals the inherent challenges of this classification task. The model demonstrates a measurable ability to discriminate between semantic and syntactic violations, suggesting that it may capture neural features specific to different violation types. Among the three classes, semantic violations are recognized most reliably, achieving a recall rate of 60.6%. Meanwhile, the model exhibits complex multidirectional error patterns, with confusion between correct trials and syntactic violations being the dominant error type. This phenomenon may arise because the recognition of syntactic violations depends more strongly on contextual and individual factors, rendering their neural patterns more difficult to distinguish from those of correct sentences at the single-trial level. In contrast, direct confusion between semantic and syntactic violations remains relatively low, suggesting that the model may have learned neural representations that differentiate between distinct types of anomalies.

Overall, these findings are consistent with those reported in Santamaría-Vázquez et al. (2020). Since single-trial decisions often fail to achieve the required accuracy, repeated trials are necessary to ensure reliable classification. Although improvements were observed compared with existing algorithms, the current state of research remains insufficient to achieve satisfactory discriminative capability in cross-subject single-trial settings. Furthermore, substantial intersubject variability resulted in relatively high algorithm variance.

Discussion

Ablation study

An ablation study was conducted within our model framework to assess the contribution of each component to overall performance. To ensure a fair comparison, all hyperparameters were kept identical, while specific components or architectural modules were removed or replaced, with the remaining parts of the model adjusted accordingly. Six ablated variants were examined: (a) without the multi-scale temporal extraction module; (b) without raw EEG input; (c) without HT-EEG input; (d) without the contrastive learning loss module; (e) replacing the projection head with an MLP; and (f) using a single-path contrastive loss.

The results in Table 6 highlight several key findings.

Table 6.

Result of Ablation Study on Dataset II

Method	Without multi-scale temporal convolution module	Only raw EEG input	Only HT EEG input	Without contrastive loss	With MLP projection	Only fusion-view loss	Ours
sub1	0.5514	0.5454	0.5318	0.5505	0.5127	0.5139	0.5277
sub2	0.5282	0.5511	0.533	0.539	0.5487	0.5674	0.6051
sub3	0.6822	0.6707	0.6958	0.6764	0.7321	0.7042	0.7219
sub4	0.6368	0.6483	0.6413	0.6685	0.6946	0.6937	0.6908
sub5	0.6727	0.6994	0.6445	0.6701	0.6372	0.6976	0.671
sub6	0.569	0.6348	0.6151	0.5792	0.5648	0.5845	0.5822
sub7	0.5419	0.6209	0.671	0.5674	0.6087	0.6099	0.622
sub8	0.5845	0.6189	0.6149	0.6711	0.6305	0.6317	0.6725
sub9	0.6382	0.6286	0.6495	0.6477	0.6665	0.6755	0.7006
sub10	0.6629	0.6859	0.6726	0.6925	0.7045	0.6971	0.7335
sub11	0.6284	0.6775	0.6472	0.6185	0.6504	0.6324	0.6951
sub12	0.5716	0.5567	0.5006	0.5456	0.5616	0.583	0.6045
Mean	0.6057	0.6282	0.6181	0.6189	0.6260	0.6326	0.6522
SD	0.0541	0.0532	0.0628	0.0588	0.0684	0.0623	0.0630

EEG, electroencephalography; HT, Hilbert-transformed; SD, standard deviation.

Multi-scale temporal extraction module: Employing temporal filters of different scales allowed the activation of features across multiple frequency bands, thereby enhancing the model’s temporal feature extraction capacity. Specifically, the AUC achieved by using the single-scale temporal extraction module is 0.6057 ± 0.0541, which lags behind our method by 4.65 percentage points.

Multi-view feature extraction module: Extracting features jointly from raw EEG and HT-EEG enabled the model to leverage critical phase information inherent in ERP signals. The exclusion of raw EEG led to the largest decrease in average multi-class AUC, confirming that the combination of raw and HT EEG inputs is particularly effective. The AUC values obtained using only raw EEG and HT-EEG are 0.6282 ± 0.0532 and 0.6181 ± 0.0628, respectively, which are 2.4 percentage points and 3.41 percentage points lower than our method.

Contrastive learning loss module: Incorporating contrastive loss facilitated the extraction of domain-invariant features during training, improving the model’s performance under cross-subject conditions. Without using the contrastive learning loss, the AUC is 0.6189 ± 0.0588, which is reduced by 3.33 percentage points.

Incep2Incep module with MLP projection: Under conditions with limited EEG samples, replacing the projection head with an MLP considerably increased the number of parameters, thereby raising the risk of overfitting and potentially impairing the ability of contrastive loss to extract domain-invariant features. Using the incep2incep module increases the AUC from 0.6260 ± 0.0684 by approximately 2.62 percentage points.

Multi-path contrastive learning loss: The three-path loss explicitly maintained and regularized the discriminative subspace of each modality, preventing detail loss due to the “fusion bottleneck.” Moreover, the independent losses of raw and phase views acted as a form of soft supervision, reducing the risk of the fusion loss being disproportionately biased toward a single modality. The AUC achieved using only the fused-view loss is 0.6522 ± 0.0630, which is 1.96 percentage points lower than our method.

Overall, our framework integrates modules that enhance feature extraction breadth, improve domain-invariant representation learning, and mitigate overfitting. The ablation experiments collectively demonstrate the contribution of each component and underscore their potential in advancing cross-subject ERP recognition tasks.

Feature distribution visualization

To illustrate the effect of contrastive learning, we present two t-SNE (t-distributed stochastic neighbor embedding) (Com and Hinton, 2008) visualizations of features extracted from subjects by different models. These visualizations help elucidate how the features are separated and clustered in the embedding space. Figure 4 compares the t-SNE projections of the baseline model and our model on the Dataset I training set. With the baseline model, features extracted from EEG signals disperse around distinct cluster centers that largely align with individual subjects (Fig. 4a). Clusters from different subjects that correspond to the same task class lie far apart in the embedding space; for example, features for the same task from Subject 1 and Subject 2 are distant, yet they can be closer to features from different tasks of the same subject.

FIG. 4.

t-SNE visualization of features for (a) Baseline and (b) MVCLDG on Dataset I. Dots denote projected features from the training set (source domain). Marker color and shape indicate class and subject, respectively. Arrows highlight the true subject/class associated with each cluster. MVCLDG, multi-view contrastive learning domain generalization; t-SNE, t-distributed stochastic neighbor embedding.

By contrast, MVCLDG yields features with clearer separation across stimulus classes (Fig. 4b). We attribute this to our domain-alignment mechanism together with the contrastive loss, which enlarges interclass differences while reducing intra-class variance. These results indicate that the model effectively learns domain-invariant representations, mitigates intersubject variability without sacrificing class separability, and thereby facilitates cross-subject ERP recognition. Quantitative evidence further substantiates the enhanced feature separability: Our model achieves a separation ratio (defined as the ratio of interclass to intra-class feature distances) of 1.8031, corresponding to a 9.79% improvement over the baseline model (1.6423).

Furthermore, Figure 5 contrasts the t-SNE projections of the baseline and our model on the Dataset I test set. The baseline produces test-sample features that are noticeably more dispersed and lack explicit cross-subject alignment; in contrast, MVCLDG forms tighter within-class clusters, and the concentric arrangement suggests that contrastive learning induces a hyperspherical feature space. Taken together, these t-SNE visualizations confirm that our approach learns features that are both domain-invariant and class-discriminative, and that this generalization extends to unseen test samples.

FIG. 5.

t-SNE visualization of features for (a) Baseline and (b) MVCLDG on Dataset I. Dots denote projected features from the test set (target domain). Different colors indicate different classes. MVCLDG, multi-view contrastive learning domain generalization; t-SNE, t-distributed stochastic neighbor embedding.

Interpretability

We employed class activation maps (CAM) (Chattopadhay et al., 2018; Selvaraju et al., 2017) to visualize the spatiotemporal activation patterns generated by MVCLDG and validated them against established neurophysiological findings. Among existing CAM methods, Eigen-CAM (Muhammad and Yeasin, 2020) visualizes the principal components of learned features without altering the network architecture or requiring gradient backpropagation, thereby providing a robust and reliable localization of discriminative regions. It has also been shown to outperform alternative approaches in identifying critical areas (Muhammad and Yeasin, 2020). Therefore, we adopted Eigen-CAM as our visualization tool to accurately identify the key features captured by MVCLDG.

We conducted interpretability analysis of MVCLDG on ERN (Dataset I) and semantic/syntactic violation (Dataset II) using the following procedure: high-confidence correctly classified trials were first selected as input, and Eigen-CAM was applied to obtain CAMs $S_{i j}$ and $T_{i j}$ from the spatial and temporal convolutional layers, respectively. The maps were then averaged across trials, and temporal maps were further averaged across channels, yielding the mean temporal features $T_{ij}^{'}$ and mean spatial features $S_{ij}^{'}$ for each subject–class combination. Subsequently, all $T_{ij}^{'}$ were aggregated and normalized to generate temporal heatmaps. At the peak time points $t_{j}$ of these heatmaps, $S_{i j}$ were sampled and normalized to produce category-specific topographical maps. For comparison with neurophysiological evidence, topographies at peak moments were also extracted from the raw EEG signals.

The ERN is an electrophysiological signal closely associated with error feedback. As illustrated in Figure 6, panels A1 and A2 display temporal heatmaps of spatial convolutional features for correct and incorrect feedback trials, with time zero indicating feedback onset. The black dashed line marks the moment of maximum Eigen-CAM response, denoted as the “most salient moment.” The salient moments for correct and incorrect feedback were 0.335 and 0.390 s, respectively. The highlighted regions correspond precisely to two well-established ERN components (Margaux et al., 2012): the pos-ErrP and neg-ErrP. These findings indicate that the discriminative features extracted by MVCLDG align with prior neuroscientific knowledge and effectively capture class-related information.

FIG. 6.

Temporal heatmaps and scalp topographies visualized with Eigen-CAM on Dataset I. (A1, A2) Temporal heatmaps of the spatial convolutional layer in the MVCLDG model for correct and error feedback trials. The black dashed lines indicate the most salient moment (correct: 0.335 s; error: 0.390 s), which corresponds to the positions of the known pos-ErrP and neg-ErrP components. (B1, B2) Original EEG topographic maps corresponding to the most salient time points. (C1, C2) and (D1, D2) Topographic visualizations of the temporal convolutional layer outputs at the same time points for the baseline model and the MVCLDG model, respectively. CAM, class activation maps; EEG, electroencephalography; MVCLDG, multi-view contrastive learning domain generalization.

While learning class-relevant representations, MVCLDG simultaneously reduces its focus on task-irrelevant features. Panels B1 and B2 illustrate the scalp topographies of raw EEG data at the salient moments for correct and incorrect feedback trials. Panels C1 and C2, together with D1 and D2, present the temporal convolutional features of correct and incorrect feedback trials visualized from the baseline (MLDG) and MVCLDG models, respectively. Panels B1 and B2 suggest that, at salient time points, the topographies of raw EEG signals offer limited interpretability and fail to directly reveal channel relevance, often necessitating comparisons between target and nontarget conditions to identify informative electrodes. In contrast, CAM-based temporal feature visualizations explicitly highlight the relative importance of individual channels. Panels D1 and D2 display MVCLDG’s spatial feature visualizations at the salient moments. Panel D1 shows that MVCLDG assigns strong importance to channel P7 for correct feedback, whereas panel D2 indicates that Cz plays a dominant role in distinguishing erroneous feedback. The channels highlighted in the visualization exhibit scalp-surface topographic distributions that are consistent with those of brain regions well established in ERN research, thereby supporting the validity of MVCLDG’s spatial feature extraction. Compared with the baseline results in panels C1 and C2, MVCLDG places less emphasis on task-irrelevant regions and greater emphasis on task-relevant ones, underscoring its stronger capability for extracting domain-invariant features.

A semantic/syntactic violation refers to the occurrence in sentence processing when a word or structure violates semantic or grammatical rules. Such violations trigger additional neural resources to process the linguistic anomaly, producing characteristic electrophysiological responses. As illustrated in Figure 7, panels A1 and A2 present temporal heatmaps of spatial convolutional features for semantic and syntactic violation trials visualized by Eigen-CAM, with time zero marking the critical word onset. Because MVCLDG captured a greater number of salient moments in the more complex neural dynamics of Dataset II, the black dashed lines indicate overlaps between well-known ERP components (N400 and P600) and Eigen-CAM highlighted regions, which we refer to as “prior salient moments.” Semantic and syntactic violations were associated with salient moments at 0.374 and 0.595 s, respectively, again capturing features consistent with prior neuroscientific knowledge.

FIG. 7.

Temporal heatmaps and scalp topographies visualized with Eigen-CAM on Dataset II. (A1, A2) Temporal heatmaps of the spatial convolutional layer in the MVCLDG model for semantic violation and syntactic violation trials. The prior salient time points captured by the model (semantic: 0.374 s; syntactic: 0.595 s) correspond to the time windows of the classical event-related potential components N400 and P600 associated with language processing. (B1, B2) Original EEG topographic maps corresponding to the prior salient time points. (C1, C2) Spatial feature visualizations of the MVCLDG model at the same time points. CAM, class activation maps; EEG, electroencephalography; MVCLDG, multi-view contrastive learning domain generalization.

Panels C1 and C2 show MVCLDG’s spatial feature visualizations at the salient moments. Compared with the raw EEG signals shown in panels B1 and B2, panel C1 demonstrates that MVCLDG assigns strong importance to channels Fz and F1 during semantic violations, whereas panel C2 indicates heightened attention to channel CP1 during syntactic violations. The scalp-topographic emphasis on fronto-central (e.g., F1/Fz) and centro-parietal (e.g., CP1) electrodes for semantic and syntactic violations, respectively, is broadly consistent with the canonical distributions of the N400 and P600 components. This indirectly supports that our model learns features whose inferred neural generators align with regions traditionally associated with each process, confirming its ability to capture domain-invariant neurophysiological patterns.

BCI practicality

The number of trainable parameters during training and inference for the proposed method on Dataset I is summarized in Table 7. As a canonical baseline, EEGNet is the most parameter-efficient model among all compared methods. Conventional domain generalization methods do not modify the network architecture; therefore, their parameter counts remain identical to that of the baseline EEG-Inception model. Notably, DANN and contrastive learning–based methods exhibit distinct behaviors between training and inference: DANN removes the adversarial discriminator at inference time, whereas contrastive learning approaches discard the projection head.

Table 7.

Comparison of Model Complexity Across Different Methods

Method	Training parameters	Inference parameters
EEGNet	2,746	2,746
EEG-Inception	42,548	42,548
EEG-Conformer	211,306	211,306
HiReNet	14,298	14,298
Mixup	42,548	42,548
MMD	42,548	42,548
IRM	42,548	42,548
MLDG	42,548	42,548
DANN	46,467	42,548
SelfReg	87,438	42,254
Ours	103,940	28,492

DANN, domain-adversarial neural network; EEG, electroencephalography; IRM, invariant risk minimization; MLDG, meta-learning for domain generalization; MMD, maximum mean discrepancy.

Built upon multi-view contrastive learning, our method achieves a distinctive balance between model capacity and deployment efficiency. During training, an additional projection module termed Incep2Incep is introduced, comprising 57,012 parameters. This module mirrors the encoder architecture while incorporating additional nonlinearity, thereby enhancing feature learning and improving cross-subject generalization. At inference, the projection module is entirely removed, leaving only the encoder and classifier, resulting in a total of 28,492 inference parameters. This design reflects the fact that the projection module primarily serves to structure the feature space during training; once optimization is completed, this role is effectively absorbed by the encoder parameters.

From a practical BCI perspective, this design offers two key advantages. First, the lightweight inference model enables faster response times in real-world closed-loop systems. Second, despite the reduced parameter count during inference, the higher model capacity during training ensures superior classification performance compared with all baseline approaches.

Limitations and future work

This study still has considerable scope for further improvement. First, MVCLDG currently exploits only the raw signal and the Hilbert-transformed signal, without fully leveraging the potential advantages of incorporating multiple views. Future work may focus on exploring additional representative views, such as time–frequency views and amplitude views, to facilitate the extraction of domain-invariant features. Second, MVCLDG is primarily designed for recognizing different stimuli in cross-subject scenarios within a single dataset. However, its effectiveness across datasets and, more broadly, across tasks has yet to be systematically explored and validated. In future work, we plan to directly apply the model trained on the dataset in this study to publicly available ERP datasets with similar paradigms (e.g., applying the model trained on Dataset I to a P300 speller dataset) for cross-dataset validation.

Rehabilitation medicine increasingly acknowledges the value of EEG analysis in supporting patient recovery. Nevertheless, the neurophysiological significance of MVCLDG remains limited. Further investigations leveraging multimodal techniques, such as simultaneous EEG–fMRI acquisition, to more precisely characterize task-related and violation-related brain activity may provide valuable opportunities for real-time monitoring and assessment of rehabilitation progress.

Advancing the clinical translation of the proposed approach represents an important direction for future research. For example, the MVCLDG framework could be integrated into portable EEG systems to facilitate the assessment of post-stroke cognitive function. Patients would perform standardized semantic tasks (e.g., the Dataset II paradigm), while the system analyzes their ERP components and compares them against normative baselines derived from healthy cohorts, thereby providing clinicians with quantitative indicators of cognitive impairment and recovery. A key strength of MVCLDG lies in its cross-subject generalization capability, which allows models pre-trained on large-scale healthy datasets to be directly deployed for new patients, enabling reliable detection without the need for individual calibration. This property is of particular clinical relevance for patient populations with language or motor impairments, for whom active behavioral feedback is difficult or infeasible.

Conclusion

This article presents a multi-view domain generalization contrastive learning framework, MVCLDG, designed to enhance the classification performance of ERP signals in cross-subject scenarios. Our approach leverages both raw EEG signals and Hilbert-transformed signals as complementary views, thereby emphasizing critical phase information in ERP data. It incorporates both view-independent and view-fusion contrastive losses to ensure the learning of discriminative domain-invariant features from different views of the same sample while promoting mutual knowledge transfer between views. The proposed multi-view cross-prediction mechanism yields more discriminative interclass features and more consistent intra-class features for downstream tasks. The effectiveness of the method was validated on both a public ERN dataset and a self-collected semantic syntactic violation dataset, where it outperformed existing domain generalization models. In addition, Eigen-CAM–based visualization analysis indicates that the key spatiotemporal patterns exploited by the model—such as the pronounced focus on the central–parietal Cz electrode in the ERN task—exhibit a high degree of correspondence with the canonical scalp topographic distributions of the associated ERP components. This scalp-level topographic consistency provides neurophysiological support for the effectiveness of the proposed model, demonstrating the potential of the proposed method to advance ERP signal recognition.

Authors’ Contributions

C.C.: Conceptualization, methodology, formal analysis, investigation, writing—original draft, and visualization. L.X.: Investigation. J.Z.: Conceptualization, methodology, writing—review and editing, and supervision. Q.Q.: Methodology, investigation, resources, and writing—review and editing. H.J. and J.L.: Conceptualization, methodology, resources, writing—review and editing, supervision, and funding acquisition.

Footnotes

Ethics Statement

This work involved human subjects in its research. Approval of all ethical and experimental procedures and protocols was granted by the Ethics Review Committee of Shanghai Yangzhi Rehabilitation Hospital under Application No. Yangzhi Lun Shen Zi [2023]036 and performed in line with the Declaration of Helsinki.

Author Disclosure Statement

The authors declare that they have no conflicts of interest.

Funding Information

This work was supported in part by the National Natural Science Foundation of China under Grant 62472319, the Second Round of the “Three-Year Action Plan to Promote Clinical Skills and Clinical Innovation in Municipal Hospitals” Research Physician Innovation Transformation Capability Training Program under Grant SHDC2023CRT001, the National Clinical Key Specialty Construction Project of China under Grant Z155080000004, the Shanghai Research Center of Rehabilitation Medicine (Top Priority Research Center of Shanghai) under Grant 2023ZZ02027, and the Shanghai Disabled Persons’ Federation Key Laboratory of Intelligent Rehabilitation Assistive Appliance and Technology.

References

, Zhao

, Du

, et al. Amplitude–time dual-view fused EEG temporal feature learning for automatic sleep staging. IEEE Trans Neural Netw Learn Syst 2024a; 35(5):6492–6506; doi: 10.1109/tnnls.2022.3210384

, Kim

, Chikontwe

, et al. Dual attention relation network with fine-tuning for few-shot EEG motor imagery classification. IEEE Trans Neural Netw Learn Syst 2024b; 35(11):15479–15493; doi: 10.1109/tnnls.2023.3287181

Arjovsky

, Bottou

, Gulrajani

, et al. Invariant risk minimization. ArXiv 2020. Available from: https://arxiv.org/abs/1907.02893

Barachant

, Bonnet

, Congedo

, et al. Multiclass brain–computer interface classification by Riemannian geometry. IEEE Trans Biomed Eng 2012;59(4):920–928; doi: 10.1109/tbme.2011.2172210

Chattopadhay

, Sarkar

, Howlader

, et al. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE; 2018; doi: 10.1109/wacv.2018.00097

Chaudhary

, Birbaumer

, Ramos-Murguialday

. Brain–computer interfaces for communication and rehabilitation. Nat Rev Neurol 2016;12(9):513–525; doi: 10.1038/nrneurol.2016.113

Chen

, Kornblith

, Norouzi

, et al. A simple framework for contrastive learning of visual representations. 2020. Available from: http://proceedings.mlr.press/v119/chen20j.html

Com

, Hinton

. Visualizing data using t-SNE Laurens van der Maaten. J Mach Learn Res 2008;9:2579–2605. Available from: http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf

Congedo

, Barachant

, Bhatia

. Riemannian geometry for EEG-based brain-computer interfaces; a primer and a review. Brain-Computer Interfaces 2017;4(3):155–174; doi: 10.1080/2326263x.2017.1297192

10.

Craik

, He

, Contreras-Vidal

. Deep learning for electroencephalogram (EEG) classification tasks: A review. J Neural Eng 2019;16(3):e 31001; doi: 10.1088/1741-2552/ab0ab5

11.

Delorme

, Makeig

. Eeglab: An open source toolbox for analysis of single-trial eeg dynamics including independent component analysis. J Neurosci Methods 2004;134(1):9–21; doi: 10.1016/j.jneumeth.2003.10.009

12.

Deng

, Li

, Hong

, et al. A novel multi-source contrastive learning approach for robust cross-subject emotion recognition in eeg data. Biomed Signal Process Control 2024;97:106716; doi: 10.1016/j.bspc.2024.106716

13.

Devlin

, Chang

, Lee

, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North 1. IEEE; 2019; pp. 4171–4186; doi: 10.18653/v1/n19-1423

14.

Dmochowski

, Sajda

, Dias

, et al. Correlated components of ongoing EEG point to emotionally laden attention—a possible marker of engagement? Front Hum Neurosci 2012;6:112; doi: 10.3389/fnhum.2012.00112

15.

Hasson

, Nir

, Levy

, et al. Intersubject synchronization of cortical activity during natural vision. Science 2004;303(5664):1634–1640; doi: 10.1126/science.1089506

16.

Helfrich

, Knight

. Cognitive neurophysiology: Event-related potentials. Handb Clin Neurol 2019;160:543–558; doi: 10.1016/B978-0-444-64032-1.00036-9

17.

Holz

, Höhne

, Staiger-Sälzer

, et al. Brain–computer interface controlled gaming: Evaluation of usability by severely motor restricted end-users. Artif Intell Med 2013;59(2):111–120; doi: 10.1016/j.artmed.2013.08.001

18.

Ingolfsson

, Hersche

, Wang

, et al. (2020) EEG-TCNet: an accurate temporal convolutional network for embedded motor-imagery brain–machine interfaces. In: Repository for Publications and Research Data (ETH Zurich). IEEE; doi:10.1109/smc42975.2020.9283028

19.

Kiela

, Bhooshan

, Firooz

, et al. Supervised multimodal bitransformers for classifying images and text. ArXiv 2020; doi: 10.48550/arXiv.1909.02950

20.

Kim

, Yoo

, Park

, et al. Selfreg: Self-supervised contrastive regularization for domain generalization. Thecvf.com; 2021; pp. 9619–9628. Available from: http://openaccess.thecvf.com/content/ICCV2021/html/Kim_SelfReg_Self-Supervised_Contrastive_Regularization_for_Domain_Generalization_ICCV_2021_paper.html

21.

Kim

, Im

. Hirenet: Novel convolutional neural network architecture using Hilbert-transformed and raw electroencephalogram (EEG) for subject-independent emotion classification. Comput Biol Med 2024;178:108788; doi: 10.1016/j.compbiomed.2024.108788

22.

Krusienski

, Sellers

, McFarland

, et al. Toward enhanced p300 speller performance. J Neurosci Methods 2008;167(1):15–21; doi: 10.1016/j.jneumeth.2007.07.017

23.

Lawhern

, Solon

, Waytowich

, et al. EEGNet: A compact convolutional neural network for EEG-based brain–computer interfaces. J Neural Eng 2018;15(5):e056013; doi: 10.1088/1741-2552/aace8c

24.

LeCun

, Bengio

, Hinton

. Deep learning. Nature 2015;521(7553):436–444; doi: 10.1038/nature14539

25.

, Yang

, Song

, et al. Learning to generalize: Meta-learning for domain generalization. AAAI 2018a;32(1); doi: 10.1609/aaai.v32i1.11596

26.

, Pan

, Wang

, et al. Domain generalization with adversarial feature learning. Thecvf.com; 2018b; pp. 5400–5409. Available from: http://openaccess.thecvf.com/content_cvpr_2018/html/Li_Domain_Generalization_With_CVPR_2018_paper.html

27.

, Wang

, Qiao

, et al. An effective self-supervised framework for learning expressive molecular global representations to drug discovery. Brief Bioinform 2021a;22(6):bbab109; doi: 10.1093/bib/bbab109

28.

, Zheng

, Zong

, et al. A bi-hemisphere domain adversarial neural network model for EEG emotion recognition. IEEE Trans Affective Comput 2021b;12(2):494–504; doi: 10.1109/taffc.2018.2885474

29.

, Tang

, Liu

, et al. Consensus graph learning for multi-view clustering. IEEE Trans Multimedia 2022;24:2461–2472; doi: 10.1109/tmm.2021.3081930

30.

Liu

, Luo

, Li

, et al. Deep geometric representations for modeling effects of mutations on protein-protein binding affinity. PLoS Comput Biol 2021;17(8):e1009284; doi: 10.1371/journal.pcbi.1009284

31.

Lotte

, Congedo

, Lécuyer

, et al. A review of classification algorithms for EEG-based brain–computer interfaces. J Neural Eng 2007;4(2):R1–R13; doi: 10.1088/1741-2560/4/2/r01

32.

Mak

, Wolpaw

. Clinical applications of brain-computer interfaces: Current state and future prospects. IEEE Rev Biomed Eng 2009;2:187–199; doi: 10.1109/RBME.2009.2035356

33.

Margaux

, Emmanuel

, Sébastien

, et al. Objective and subjective evaluation of online error correction during p300-based spelling. Adv Hum Comput Interact 2012;2012:1–13; doi: 10.1155/2012/578295

34.

Miao

, Zhao

, Zhang

, et al. LMDA-Net: A lightweight multi-dimensional attention network for general EEG-based brain-computer interfaces and interpretability. Neuroimage 2023;276:120209; doi: 10.1016/j.neuroimage.2023.120209

35.

Mohsenvand

, Izadi

, Maes

. Contrastive representation learning for electroencephalogram classification. PMLR 2020;136:238–253. Available from: http://proceedings.mlr.press/v136/mohsenvand20a.html

36.

Moon

, Jang

, Lee

. Convolutional neural network approach for EEG-based emotion recognition using brain connectivity and its spatial information. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2018; doi:10.1109/icassp.2018.8461315

37.

Muhammad

, Yeasin

. Eigen-CAM: Class activation map using principal components. In: 2020 International Joint Conference on Neural Networks (IJCNN). IEEE; 2020, pp. 1–7 doi:10.1109/IJCNN48605.2020.9206626

38.

Müller

, Mika

, Tsuda

, et al. An introduction to kernel-based learning algorithms. CRC Press eBooks; 2018; pp. 4–40; doi:10.1201/9781315220413-4

39.

Nicolas-Alonso

, Gomez-Gil

. Brain computer interfaces, a review. Sensors (Basel) 2012;12(2):1211–1279; doi: 10.3390/s120201211

40.

Ozdenizci

, Wang

, Koike-Akino

, et al. Learning invariant representations from EEG via adversarial inference. IEEE Access 2020;8:27074–27085; doi: 10.1109/access.2020.2971600

41.

Pan

, Cai

, Huang

, et al. Multiple scale convolutional few-shot learning networks for online p300-based brain–computer interface and its application to patients with disorder of consciousness. IEEE Trans Instrum Meas 2023;72:1–16; doi: 10.1109/tim.2023.3267367

42.

Paszke

, Gross

, Massa

, et al. Pytorch: An imperative style, high-performance deep learning library. 2019. Available from: https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html

43.

Polich

. Updating p300: An integrative theory of p3a and p3b. Clin Neurophysiol 2007;118(10):2128–2148; doi: 10.1016/j.clinph.2007.04.019

44.

Rivet

, Souloumiac

, Attina

, et al. xDAWN algorithm to enhance evoked potentials: Application to brain–computer interface. IEEE Trans Biomed Eng 2009;56(8):2035–2043; doi: 10.1109/tbme.2009.2012869

45.

Roy

, Banville

, Albuquerque

, et al. Deep learning-based electroencephalography analysis: A systematic review. J Neural Eng 2019;16(5):e051001; doi: 10.1088/1741-2552/ab260c

46.

Santamaría-Vázquez

, Martínez-Cagigal

, Vaquerizo-Villar

. EEG-inception: A novel deep convolutional neural network for assistive ERP-based brain-computer interfaces. In: IEEE Transactions on Neural Systems and Rehabilitation Engineering. IEEE; 2020; doi: 10.21227/6bdr-4w65

47.

Santamaría-Vázquez

, Martínez-Cagigal

, Gomez-Pilar

, et al. Deep learning architecture based on the combination of convolutional and recurrent layers for ERP-based brain-computer interfaces. IFMBE Proc 2019:1844–1852; doi: 10.1007/978-3-030-31635-8_224

48.

Schirrmeister

, Springenberg

, Fiederer

LDJ

, et al. Deep learning with convolutional neural networks for EEG decoding and visualization. Hum Brain Mapp 2017;38(11):5391–5420; doi: 10.1002/hbm.23730

49.

Selvaraju

, Cogswell

, Das

, et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization. 2017. Available from: http://openaccess.thecvf.com/content_iccv_2017/html/Selvaraju_Grad-CAM_Visual_Explanations_ICCV_2017_paper.html

50.

Shen

, Liu

, Hu

, et al. Contrastive learning of subject-invariant EEG representations for cross-subject emotion recognition. IEEE Trans Affective Comput 2023;14(3):2496–2511; doi: 10.1109/taffc.2022.3164516

51.

Song

, Zheng

, Liu

, et al. EEG conformer: Convolutional transformer for EEG decoding and visualization. IEEE Trans Neural Syst Rehabil Eng 2023;31:710–719; doi: 10.1109/tnsre.2022.3230250

52.

Tang

, Liu

, Zhu

, et al. CGD: Multi-view clustering via cross-view graph diffusion. AAAI 2020;34(04):5924–5931; doi: 10.1609/aaai.v34i04.6052

53.

Tang

, Zheng

, Zhang

, et al. Unsupervised feature selection via multiple graph fusion and feature weight learning. Sci China Inf Sci 2023;66(5); doi: 10.1007/s11432-022-3579-1

54.

Tao

, Liu

, Li

, et al. Marginalized multiview ensemble clustering. IEEE Trans Neural Netw Learn Syst 2020;31(2):600–611; doi: 10.1109/tnnls.2019.2906867

55.

Tian

, Krishnan

, Isola

. Contrastive multiview coding. In: Lecture Notes in Computer Science. Springer: Cham; 2020; pp. 776–794; doi:10.1007/978-3-030-58621-8_45

56.

Wang

, Tong

, Heng

. Phase-locking value based graph convolutional neural networks for emotion recognition. IEEE Access 2019;7:93711–93722; doi: 10.1109/access.2019.2927768

57.

, Xiao

, Wang

, et al. Cosleep: A multi-view representation learning framework for self-supervised learning of sleep stage classification. IEEE Signal Process Lett 2022;29:189–193; doi: 10.1109/lsp.2021.3130826

58.

Zhang

, Cui

, Han

, et al. Deep partial multi-view learning. IEEE Trans Pattern Anal Mach Intell 2020a;44(5):2402–2415; doi: 10.1109/tpami.2020.3037734

59.

Zhang

, Fu

, Wang

, et al. Tensorized multi-view subspace representation learning. Int J Comput Vis 2020b;128(8–9):2344–2361; doi: 10.1007/s11263-020-01307-0

60.

Zhang

, Cisse

, Dauphin

, et al. mixup: Beyond empirical risk minimization. ArXiv 2018. Available from: https://arxiv.org/abs/1710.09412

61.

Zhao

, Wu

, Zhang

, et al. Sleep stage classification via multi-view based self-supervised contrastive learning of EEG. IEEE J Biomed Health Inform 2024;28(12):7068–7077; doi: 10.1109/jbhi.2024.3432633

62.

Zhi

, Yu

, Gu

, et al. Supervised contrastive learning-based domain generalization network for cross-subject motor decoding. IEEE Trans Biomed Eng 2025;72(1):401–412; doi: 10.1109/tbme.2024.3432934

63.

Zhou

, Liu

, Qiao

, et al. Domain generalization: A survey. IEEE Trans Pattern Anal Mach Intell 2023;45(4):4396–4415; doi: 10.1109/tpami.2022.3195549

64.

Zhu

, Xu

, Lu

, et al. Distinct spatiotemporal patterns of syntactic and semantic processing in human inferior frontal gyrus. Nat Hum Behav 2022;6(8):1104–1111; doi: 10.1038/s41562-022-01334-6