Separate Reverse: A Gradient-Conflict-Free Training Framework for Multi-Exit Transformers

Abstract

Pretrained transformer models have demonstrated excellent performance on complex tasks. To improve their inference efficiency, recent studies have introduced the multi-exit mechanism, which enables early exiting through multiple intermediate classifiers. However, the deep architectures of pretrained transformers cause severe gradient conflicts during multi-exit fine-tuning, leading to degraded shallow-exit accuracy and reduced early-exit efficiency. To address this issue, we propose Separate Reverse, a multi-exit training strategy specifically designed for pretrained transformer models. The method iteratively integrates reverse iterative optimization and hierarchical knowledge distillation from deeper to shallower exits, maintaining pretrained parameter integrity, enhances the representation capacity of shallow exits, and coordinates gradient updates across exits to achieve a balanced optimization between shallow and deep classifiers. Experiments on multiple GLUE benchmark datasets using BERT demonstrate that our method significantly improves shallow-exit accuracy, maintains main-exit performance, and accelerates inference for simple samples by a large margin.

Keywords

multi-exit transformer gradient conflict model training strategy model optimization

1. Introduction

In recent years, pretrained transformer-based models have achieved remarkable breakthroughs in natural language processing, computer vision, and multimodal tasks (Cambria & White, 2014; Chen et al., 2024; Treviso et al., 2023). Owing to their powerful contextual modeling capabilities and scalability, transformer models have become core components of many intelligent systems (de Barcelos Silva et al., 2020; Xu et al., 2023). However, deploying these deep models in complex scenarios still faces significant challenges, including high computational cost and inference latency, which are particularly pronounced in industrial vision and monitoring tasks with strict real-time requirements (Aghajanyan et al., 2023; Yi et al., 2025). Therefore, reducing computational overhead while maintaining model performance has become a critical research direction in model optimization.

Recent studies show that input samples differ greatly in task difficulty, leading complex samples to demand more computation and simple ones to incur redundant inference (Laskaridis et al., 2021; Rahmath et al., 2024). To address this issue, the multi-exit mechanism introduces exit classifiers at different depths of the model, constructing a multi-exit transformer architecture that enables dynamic inference depending on the complexity of each sample (Schuster et al., 2022). As illustrated in Figure 1, a sample can exit early when the output confidence exceeds a predefined threshold, avoiding computation in subsequent layers. By adjusting the confidence threshold (a predefined value indicating sufficient prediction reliability for early exiting), computation can be dynamically controlled while maintaining accuracy, allowing adaptation to devices with varying computational capabilities without retraining. In this work, we investigate multi-exit transformer models for efficient inference optimization. This mechanism (Chen et al., 2023; Xin et al., 2020; Xu et al., 2025) significantly improves inference efficiency while maintaining overall performance.

Figure 1.

Multi-exit model inference.

However, applying multi-exit mechanisms to pretrained transformer models presents several challenges. First, since multiple exits share the same backbone network, the gradients from different exits are often inconsistent, leading to gradient conflicts that degrade shallow-exit performance and weaken the acceleration benefit of early exiting. Second, unlike conventional neural networks, transformer parameters exhibit strong structural consistency and semantic coherence derived from large-scale pretraining. Existing multi-exit training strategies for general networks (e.g., branch-wise, Huang et al., 2017, and separate, Lattanzi et al., 2023) can alleviate gradient conflicts but may disrupt this parameter coherence, causing shifts in intermediate feature distributions and a decline in overall model accuracy. To address this, many existing multi-exit transformer studies adopt a two-stage training strategy (Xin et al., 2020, 2021): first fine-tuning the backbone model, then freezing the backbone parameters and training only the exit classifiers. Although this approach avoids gradient conflicts, shallow exits—typically composed of a pooling and a linear layer—lack sufficient discriminative capacity. Moreover, the backbone layers of transformers are optimized to extract deep semantic representations to support the final classifier rather than serve intermediate exits, which further limits shallow exit performance (Ji et al., 2023).

In summary, while two-stage and separate strategies partially alleviate gradient conflicts, they fail to balance parameter integrity and exit capacity. Building upon these insights, we propose Separate Reverse, a multi-exit training strategy specifically designed for pretrained transformer models. The method maintains parameter integrity, enhances the representation capacity of shallow exits, and coordinates gradient updates across multiple exits to achieve balanced optimization between shallow and deep layers. Inspired by the two-stage and separate paradigms, Separate Reverse employs a reverse iterative training process from deep to shallow layers, where pretrained parameters are first fine-tuned as a whole and the branch exits are subsequently initialized and trained according to predefined exit positions (i.e., the transformer layers at which intermediate exit classifiers are placed). In each iteration, the previously trained model serves as a teacher, and hierarchical knowledge distillation is applied to mitigate gradient conflicts and catastrophic forgetting, thereby ensuring shallow exit performance while preserving the accuracy of the main exit. Experimental results demonstrate that this strategy significantly improves shallow exit accuracy, maintains main-exit stability, and accelerates inference on simple samples.

This paper makes the following key contributions:

We identify the limitations of existing multi-exit training strategies for pretrained transformer models, particularly their inability to preserve pretrained parameters and maintain balanced performance across exits.

We develop Separate Reverse, a new multi-exit training strategy that enhances shallow exit capacity while coordinating optimization between exits through hierarchical knowledge distillation to alleviate gradient conflicts and catastrophic forgetting.

We implement and evaluate our approach on transformer-based models, demonstrating significant improvements in shallow exit accuracy, stable main exit performance, and substantial inference speedup under various confidence thresholds.

2. Related Work

Recent studies on transformer inference optimization can be broadly categorized into model compression and architectural optimization. The former focuses on reducing model size and computation cost, while the latter modifies network structures to achieve adaptive computation.

2.1. Model Compression

Model compression aims to accelerate inference by reducing parameters and computation. Two mainstream approaches are parameter pruning and knowledge distillation.

Pruning removes redundant or less important parameters to create sparse transformer models. Liu et al. (2022) show that pruning overparameterized models often outperforms training small models from scratch. For example, oBERT (Kurtic et al., 2022) applies second-order information to guide nonstructured pruning, which theoretically preserves accuracy with reduced computation (Liao et al., 2020). However, hardware inefficiency limits the speedup from unstructured pruning, driving research toward structured pruning, where parameters are removed in a layer- or head-wise manner. Michel et al. (2019) analyze the impact of removing entire attention heads on model accuracy.

Knowledge distillation, on the other hand, transfers knowledge from a large teacher to a smaller student model (Gou et al., 2021). DynaBERT (Hou et al., 2020) performs layer-wise distillation to derive flexible subnetworks, while Liu et al. (2022) align teacher–student representations across multiple semantic levels for richer supervision. However, these methods still compute all layers for every input and cannot dynamically adapt to varying computational budgets without retraining, leading to redundant computation across diverse devices.

2.2. Multi-Exit Mechanism

Architectural optimization focuses on enabling early exiting to adaptively reduce computation. Although deeper transformers extract richer features, many samples can be correctly classified with shallow representations (Rahmath et al., 2024). Multi-exit transformers (Xin et al., 2020) add lightweight classifiers after intermediate feed-forward network (FFN) layers, allowing early termination for simple inputs (Gao et al., 2023).

Schuster et al. (2022) design adaptive exit confidence criteria to mitigate accuracy loss from early termination, while Tang et al. (2023) exploit feature saturation to decouple encoder–decoder computation for further efficiency. Bajpai and Hanawal (2024) propose an online learning method to determine exit points dynamically, and BADGE (Zhu et al., 2023) introduces a block-wise bypass mechanism comparing consecutive exit predictions. FastBERT (Liu et al., 2020) further integrates self-distillation to balance accuracy and latency through adaptive inference delay.

Such models enable flexible tradeoffs between accuracy and efficiency by adjusting exit thresholds (Rahmath et al., 2024). However, gradient conflicts arise during fine-tuning—parameters are jointly optimized under multiple exit losses with inconsistent directions, degrading shallow classifier performance and diminishing the expected acceleration. Moreover, when few samples exit early, the overhead of shallow classifiers leads to redundant computation. Existing works rarely address these gradient conflicts, limiting the optimization potential of multi-exit transformers.

3. Challenges of Multi-Exit Transformers

In this section, we first describe the gradient conflict problem in multi-exit transformer models and analyze its impact on the accuracy of different exits. We then summarize existing training strategies and discuss their limitations in pretrained transformer models, which motivates the method proposed in the next section.

3.1. Execution Mechanism and Gradient Conflicts of Multi-Exit Transformers

The architecture of multi-exit transformer, as illustrated in Figure 2, consists of a backbone transformer network and several branch classifiers. The backbone includes an embedding layer and $L$ transformer encoder layers. Each encoder layer comprises a FFN and a multihead attention submodule, followed by residual connections and layer normalization. Given an input sequence of length $n$ with token embedding dimension $d$ , the input can be denoted as $x = (t_{1}, t_{2}, \dots, t_{n})$ . The hidden representation at the $i$ -th layer is formulated as:

\begin{aligned} h_{i} & = Encoder (h_{i - 1}; θ_{i}) \\ h_{0} & = Embedding (x; θ^{Emb}) \end{aligned}

(1)

where

θ^{Emb}

and

θ_{i}

represent the parameters of the embedding layer and the

i

th encoder layer, respectively.

Figure 2.

Architecture of the multi-exit transformer.

Each branch module performs early classification through a pooling layer, a linear layer, and a softmax layer, forming multiple exit classifiers. The final output layer of the transformer is treated as the main exit. Assuming there are $H$ exits, the prediction at the $i$ -th exit is:

{\hat{y}}_{i} = Exit (h_{j}; w_{i})

(2)

where

w_{i}

denotes the parameters of the

i

-th exit classifier and

j

is the encoder layer index where the exit is attached.

For classification tasks, following prior multi-exit studies, the exit confidence is defined as the maximum logit value of the softmax output. The inference follows a confidence-based rule:

\hat{y} = {\begin{cases} {\hat{y}}_{1}, & if c_{1} > λ \\ {\hat{y}}_{2}, & if c_{2} > λ \\ ⋮ \\ {\hat{y}}_{H}, & otherwise \end{cases}

(3)

where

c_{i}

is the maximum logit of the

i

-th exit and

λ

is the confidence threshold. The loss for each exit classifier is given by:

L_{(i)}^{CE} = CrossEntropy (y, {\hat{y}}_{i})

(4)

where

y

denotes the label vector corresponding to the input text sequence

x

. The model is trained jointly across all exits, and the total objective is:

min_{θ^{Emb}, θ_{1}, \dots, θ_{L}, w_{1}, \dots, w_{H}} \sum_{i = 1}^{H} L_{(i)}^{CE}

(5)

However, this joint training strategy causes the shallow layers of the model to receive gradient signals from multiple exit classifiers. Since the gradient directions of different exits are not always consistent, this may lead to suboptimal optimization.

To verify this hypothesis, we constructed a multi-exit BERT (Devlin et al., 2019) by adding three intermediate exit classifiers at the second, fourth, and sixth layers of the original BERT model, as shown in Figure 2. Using the recognizing textual entailment (RTE) dataset from the GLUE benchmark (Wang et al., 2018), we trained the model jointly with a batch size of 8 and a learning rate of $2 \times 10^{- 5}$ . During training, we recorded the gradients of the FFN modules at layers 2, 4, and 6, and computed the mean cosine similarity between the gradients of any two exits. A higher similarity indicates aligned gradient directions, while lower similarity suggests near-orthogonal directions. The results are shown in Figure 3. The horizontal and vertical axes represent different exit classifiers, and each value denotes the cosine similarity between the FFN gradients received from the corresponding pair of exits. Among them, Exit 4 corresponds to the original output classifier of the Transformer model.

Figure 3.

Cosine similarity between gradients of different exit classifiers.

The results indicate that the gradient directions between shallow and deeper exits are largely inconsistent, with average cosine similarities approaching zero. To examine the impact of these gradient conflicts on model accuracy, we compared the joint training strategy with an Oracle training strategy on several GLUE datasets. The Oracle strategy trains truncated BERT models independently with 2, 4, 6, and 12 layers, avoiding gradient conflicts while preserving the full representational capacity of each exit. As shown in Table 1, the Oracle strategy consistently outperforms joint training at shallow exits across all datasets, with the accuracy gap reaching up to 16.0% on QNLI. These results demonstrate that gradient conflicts in joint training significantly degrade the performance of early exits.

Table 1.

Accuracy of Joint and Oracle Training Strategies Across Different Datasets.

		Accuracy
Dataset	Method	Exit Layer2	Exit Layer4	Exit Layer6	Exit Layer12
SST-2	Joint	${83.8}_{0.3}$	87.3 $_{0.7}$	89.9 $_{1.1}$	91.9 $_{0.5}$
	Oracle	85.1 $_{0.7}$	87.7 $_{0.2}$	90.6 $_{0.4}$	92.5 $_{0.4}$
QQP	Joint	82.8 $_{1.6}$	86.6 $_{0.9}$	90.1 $_{1.2}$	91.1 $_{1.1}$
	Oracle	85.1 $_{0.2}$	89.0 $_{0.2}$	89.9 $_{0.4}$	91.0 $_{0.7}$
QNLI	Joint	63.0 $_{0.1}$	85.1 $_{0.7}$	87.2 $_{0.7}$	90.9 $_{0.8}$
	Oracle	73.1 $_{0.7}$	85.6 $_{0.4}$	87.1 $_{0.2}$	90.0 $_{0.5}$
MNLI	Joint	69.1 $_{0.2}$	76.7 $_{0.8}$	80.6 $_{0.5}$	84.2 $_{1.3}$
	Oracle	71.1 $_{1.3}$	76.0 $_{0.2}$	78.4 $_{0.6}$	83.8 $_{0.1}$

3.2. Multi-Exit Model Training Strategies

To improve exit performance, several gradient-conflict-free training strategies have been proposed, as shown in Figure 4. Branch-wise (Huang et al., 2017) trains each exit sequentially from shallow to deep, freezing parameters shared with previous branches and updating only branch-specific parameters and the classifier. Separate (Lattanzi et al., 2023) is similar but does not freeze shared parameters, treating each exit as an independent submodel. While these methods are effective for training conventional deep networks from scratch, they disrupt the coordination among pretrained parameters in transformer-based models, altering intermediate feature distributions and fragmenting pretrained knowledge.

Figure 4.

Multi-exit training strategies.

Experimental results in Table 2 show that branch-wise achieves near-Oracle performance at the shallowest exit, as the input embeddings maintain their original distribution. However, it suffers substantial accuracy degradation in deeper exits, with minimal improvement as model depth increases; for SST-2 and QQP, accuracy plateaus between exit layers 4 and 6. Separate shows similar trends and suffers from catastrophic forgetting: for SST-2, the sixth exit reaches $85.8$ during training but drops to $51.3$ at the end; QQP exhibits the same issue.

Table 2.

Accuracy of Existing Training Strategies on Various Datasets.

		Accuracy
Dataset	Method	Exit Layer2	Exit Layer4	Exit Layer6	Exit Layer12
SST-2	Branch-wise	84.9 $_{0.7}$	85.0 $_{0.9}$	84.9 $_{0.6}$	88.6 $_{0.4}$
	Separate	84.7 $_{1.1}$	85.4 $_{0.6}$	51.3 $_{0.5}$	85.4 $_{0.3}$
	Two-stage	72.8 $_{0.2}$	77.2 $_{0.7}$	81.8 $_{0.6}$	92.4 $_{0.6}$
	Oracle	85.1 $_{0.7}$	87.7 $_{0.2}$	90.6 $_{0.4}$	92.5 $_{0.4}$
QQP	Branch-wise	84.8 $_{1.0}$	86.4 $_{1.1}$	86.5 $_{1.4}$	86.6 $_{1.0}$
	Separate	83.9 $_{0.1}$	49.4 $_{0.4}$	71.3 $_{0.8}$	88.8 $_{0.6}$
	Two-stage	70.0 $_{1.0}$	77.9 $_{0.9}$	82.7 $_{1.0}$	91.0 $_{0.5}$
	Oracle	85.1 $_{0.2}$	89.0 $_{0.2}$	89.9 $_{0.4}$	91.0 $_{0.7}$
QNLI	Branch-wise	69.4 $_{1.3}$	82.9 $_{1.0}$	83.6 $_{0.4}$	88.2 $_{0.7}$
	Separate	66.8 $_{0.9}$	84.2 $_{0.6}$	84.4 $_{0.1}$	84.4 $_{0.1}$
	Two-stage	59.6 $_{0.8}$	80.2 $_{0.3}$	85.0 $_{1.5}$	90.7 $_{1.0}$
	Oracle	73.1 $_{0.7}$	85.6 $_{0.4}$	87.1 $_{0.2}$	90.0 $_{0.5}$
MRPC	Branch-wise	69.9 $_{0.7}$	71.8 $_{1.7}$	73.3 $_{0.4}$	82.8 $_{0.9}$
	Separate	70.6 $_{0.4}$	74.5 $_{0.8}$	74.8 $_{0.8}$	75.0 $_{0.1}$
	Two-stage	68.4 $_{1.1}$	69.9 $_{0.9}$	73.3 $_{0.7}$	86.0 $_{0.7}$
	Oracle	71.3 $_{0.2}$	77.2 $_{0.5}$	82.1 $_{0.7}$	85.8 $_{0.4}$

Two-stage training (Xin et al., 2020, 2021) splits the process into two steps: first, the backbone transformer model is trained using pretrained parameters; then, intermediate exits are added, the backbone and original classifier are frozen, and only the new exits are trained. This approach effectively avoids gradient conflicts while leveraging pretrained knowledge and is widely adopted in multi-exit pretrained transformer architectures. However, as the exit modules in transformers typically consist of only a pooler and a linear layer, the shallow exits have limited representational capacity, resulting in accuracies lower than Oracle and sometimes even below Separate, as shown in Table 2.

The key challenge addressed in this work is how to train multi-exit transformer models without gradient conflicts, while maintaining pretrained parameter integrity and ensuring sufficient capacity in shallow exit classifiers.

4. A Multi-Exit Training Strategy for Pretrained Transformer Models Based on Separate Reverse

To address gradient conflicts, this section proposes a multi-exit training strategy for pretrained transformer models, integrating the advantages of the two-stage and separate strategies. The training proceeds iteratively: the base model is first fine-tuned to preserve pretrained parameters, after which exit classifiers are initialized and trained from deep to shallow at predefined layers. In this paper, an exit refers to a lightweight classifier attached to an intermediate layer, and the terms “exit” and “exit classifier” are used interchangeably. During each stage, the model from the previous iteration serves as a teacher, and layer-wise knowledge distillation guides updates to mitigate catastrophic forgetting. Unlike conventional separate training from shallow to deep, our method reverses the order and is thus termed Separate Reverse.

4.1. Model Fine-Tuning and Exit Configuration

To maintain parameter integrity, this section first fine-tunes the original pretrained transformer models, similar to the two-stage strategy, ensuring optimal performance at the main exit. The main exit output is defined as:

{\hat{y}}_{H} = Exit (h_{L}; w_{H})

(6)

where

H

denotes the number of exits,

w_{H}

is the parameter set of the main exit, and

L

represents the number of model layers. For classification tasks, the training objective is:

\begin{aligned} L_{CE}^{H} & = CrossEntropy (y, {\hat{y}}_{H}) \\ min_{θ^{emb}, θ_{1}, \dots, θ_{L}, w_{H}} L_{H}^{CE} \end{aligned}

(7)

After fine-tuning, multiple exit classifiers are inserted into intermediate layers, transforming the model into a multi-exit transformer structure. Each exit shares the same architecture as the main classifier, consisting of a pooling layer followed by a linear projection. In this work, three shallow exits are added at layers 2, 4, and 6, as illustrated in Figure 1.

4.2. Hierarchical Knowledge Distillation

To ensure the performance of intermediate transformer classifiers, we follow a separate training strategy, jointly training each classifier with its corresponding model layers while keeping the embedding layer frozen, as shown in Figure 5. Independent training ensures that gradients only affect the relevant layers, allowing shallow layers to focus on feature extraction without interference from multiple gradients. Compared to the two-stage strategy that trains only classifier parameters, this approach better enhances the capacity of shallow exits. However, separate training suffers from catastrophic forgetting, and after training the main exit, further training of intermediate exits can overwrite shallow layers, degrading the main exit’s accuracy.

Figure 5.

Illustration of Separate Reverse training.

Figure 6.

Speed-up ratios of multi-exit BERT under different thresholds.

Figure 7.

Proportion of samples exiting at each layer (threshold $=$ 0.6).

Figure 8.

Speed-up curves of multi-exit BERT under different training strategies.

To address this, hierarchical knowledge distillation is employed, using the model itself as a teacher and training the exit classifiers from deep to shallow. First, the main exit is trained, after which the model parameters contain only the knowledge learned from the main exit. Then, the model is copied as a teacher, and during training of the next shallow exit, the teacher supervises shallow layer updates to prevent forgetting main exit knowledge. Next, after training, the model parameters contain knowledge from both the main and current exit. The old teacher is discarded, and the current model is copied as the new teacher to supervise the following exit. Finally, this process is repeated for all intermediate exits from deep to shallow, ensuring that knowledge learned from each exit is preserved, avoiding catastrophic forgetting, improving shallow exit performance, and maintaining the accuracy of deeper exits.

Formally, during training of the $j$ -th exit classifier, the same samples are propagated through both the teacher and student models, producing:

\begin{aligned} h_{i}^{j} & = Encoder (h_{i - 1}^{j}; θ_{i}^{j}) \\ h_{i}^{j + 1} & = Encoder (h_{i - 1}^{j + 1}; θ_{i}^{j + 1}) \\ s . t . & i \in {1, 2, \dots, p_{j}} \\ j \in {1, 2, \dots, H - 1} \end{aligned}

(8)

where

θ_{i}^{j + 1}

denotes the

i

-th layer parameters of the teacher model obtained after training the

(j + 1)

-th exit,

θ_{i}^{j}

denotes the corresponding parameters of the student model,

p_{j}

indicates the layer position of the

j

-th exit with

p_{j} \in [1, L - 1]

, and

H

is the total number of exit classifiers. Since the embedding parameters are fixed, we have:

h_{0}^{j} = h_{0}^{j + 1} = h_{0} = Embedding (x; θ^{Emb})

(9)

The distillation loss for the $j$ -th exit is defined as:

L_{j}^{distil} = \sum_{i = 1}^{p_{j}} MSE (h_{i}^{j}, h_{i}^{j + 1})

(10)

Considering that intermediate exits have limited capacity, forcing them to fully learn the sample distribution may negatively affect the overall model performance. We expect the performance of intermediate exits to be proportional to the number of model layers they contain. Accordingly, we define an exit capacity coefficient

δ_{j} = p_{j} / L

and weight the classification loss as:

L_{j}^{CE} = δ_{j} CrossEntropy (y, {\hat{y}}_{j})

(11)

where

{\hat{y}}_{j}

denotes the output of the

j

-th exit classifier. Consequently, the training objective of the

j

-th exit is formulated as:

min_{θ_{1}, \dots, θ_{p_{j}}, w_{j}} α L_{j}^{CE} + (1 - α) L_{j}^{distil}

(12)

where

α

balances the performance between shallow and deep exits. Algorithm 1 illustrates the training procedure of the Separate Reverse strategy applied to pretrained transformer models.

5. Evaluation

This section presents the experimental analysis of the proposed Separate Reverse training strategy for pretrained transformer models. We first introduce the datasets and evaluation metrics, followed by a description of the baselines and experimental settings. Finally, we report the results of applying the Separate Reverse strategy to the BERT model, compare them with existing studies, and further investigate the contributions of individual components through ablation studies and analysis of the performance balance coefficient.

5.1. Experimental Setup

5.1.1. Dataset and Metrics

We evaluate the performance of the BERT model on classification tasks, selecting six representative classification datasets from the GLUE Benchmark. Table 3 provides the statistical information for these datasets. Specifically:

Table 3.
Dataset Statistics.

Dataset Labels Train/Dev

RTE 2 2.5k/0.3k

SST-2 2 67k/0.9k

MRPC 2 3.7k/0.4k

QQP 2 364k/40k

MNLI 3 393k/9.8k

QNLI 2 105k/5.5k

Dataset	Labels	Train/Dev
RTE	2	2.5k/0.3k
SST-2	2	67k/0.9k
MRPC	2	3.7k/0.4k
QQP	2	364k/40k
MNLI	3	393k/9.8k
QNLI	2	105k/5.5k

Note. RTE = recognizing textual entailment.

RTE: Evaluates the model’s ability to understand entailment relationships between two text sequences.

SST-2: Tests sentiment classification, requiring the model to determine whether a given text expresses a positive or negative sentiment.

MRPC: Assesses whether two sentences have the same meaning, that is, whether they are paraphrases.

QQP: Determines whether two questions refer to the same underlying fact.

MNLI: Judges whether a hypothesis is entailed by, contradicts, or is neutral with respect to a given premise.

QNLI: Evaluates the model’s ability to locate answers within a passage, reflecting performance on question-answering tasks.

To comprehensively assess both the prediction accuracy and inference efficiency of the multi-exit transformer models, two evaluation metrics are employed: Accuracy and Speed-up Ratio. Their definitions are as follows:

Accuracy: Since this work focuses on classification tasks, accuracy is adopted as the primary evaluation metric. It represents the proportion of correctly classified text sequences over the total number of samples, formulated as:

Acc = \frac{\sum_{i = 1}^{n} I ({\hat{y}}_{i} = y_{i})}{n}

(13)

where

n

denotes the total number of samples,

{\hat{y}}_{i}

is the predicted label for the

i

-th sample, and

y_{i}

is its corresponding ground truth label.

Speed-up Ratio: To evaluate the inference acceleration of the multi-exit transformer models, the speed-up ratio is used. It measures the ratio between the total number of layers that all samples would traverse in the full model and the actual number of layers traversed during inference, expressed as:

Speed-up Ratio = \frac{n L}{\sum_{i = 1}^{n} L_{i}}

(14)

where

L

denotes the total number of model layers, and

L_{i}

represents the number of layers traversed by the

i

-th sample before exiting from an intermediate classifier. This metric quantifies the computational efficiency gained during inference.

5.1.2. Baselines and Experimental Settings

To evaluate the effectiveness of the proposed Separate Reverse strategy, we compare it against a range of representative multi-exit BERT training methods, including both classical multi-exit strategies and recent BERT-specific adaptations. Although our experiments are conducted on BERT with GLUE tasks, the proposed training framework does not rely on BERT-specific components and can be readily extended to other transformer-based models with intermediate exits. The comparative baselines are summarized below:

Oracle: Independently trains each shallow BERT model, establishing the upper performance bound for the exit classifiers.

Branch-wise (Huang et al., 2017): Sequentially trains exit classifiers from shallow to deep while freezing shared parameters. This approach limits exit capacity and disrupts pretrained weights.

Separate (Lattanzi et al., 2023): Similar to branch-wise, but jointly updates all parameters used by each exit without freezing shared layers.

Joint (Liu et al., 2023): Optimizes all exits simultaneously by summing their weighted losses, though it is susceptible to gradient conflicts.

Two-stage (Xin et al., 2020): A common multi-exit BERT strategy that first fine-tunes the base BERT, then freezes the backbone layers and trains only the exit classifiers. This avoids conflicts but reduces exit capacity.

Joint-alt (Xin et al., 2021): Alternates between joint and two-stage training: odd iterations update the backbone and main exit, while even iterations update all exits.

Joint-Boost (Yu et al., 2022): Employs gradient boosting where each exit complements previous outputs. It scales backpropagated gradients to reduce deep-to-shallow interference.

SWEET (Rotem et al., 2023): Combines the strengths of Oracle and Joint by truncating each exit’s gradient propagation to prevent cross-exit interference.

RomeBERT (Geng et al., 2021): Utilizes joint training with self-distillation to enhance shallow exits and applies gradient regularization to mitigate conflicting gradient directions.

In our experiments, the multi-exit BERT model is constructed by placing exit classifiers at the second, fourth, and sixth layers of the BERT backbone. All experiments are conducted on an NVIDIA GeForce RTX 2080 Ti GPU server, with training performed on the Train Set and validation on the Dev Set. We adopt the Adam optimizer with a linear learning rate scheduler. To ensure fairness, all compared methods use identical hyperparameter settings, the pretrained BERT weights are obtained from Hugging Face, and the balance factor $α$ is consistently fixed at $0.5$ for all datasets. Detailed configurations are provided in Table 4.

Table 4.
Training Parameters for Different Datasets.

Adam

Dataset Epoch Learning rate Batch size Beta Epsilon

RTE 3.0 2 $\times 10^{- 5}$ 16 (0.9, 0.999) 1 $\times 10^{- 8}$

SST-2 3.0 2 $\times 10^{- 5}$ 32 (0.9, 0.999) 1 $\times 10^{- 8}$

MRPC 5.0 2 $\times 10^{- 5}$ 16 (0.9, 0.999) 1 $\times 10^{- 8}$

QQP 3.0 2 $\times 10^{- 5}$ 32 (0.9, 0.999) 1 $\times 10^{- 8}$

MNLI 3.0 2 $\times 10^{- 5}$ 32 (0.9, 0.999) 1 $\times 10^{- 8}$

QNLI 3.0 2 $\times 10^{- 05}$ 32 (0.9, 0.999) 1 $\times 10^{- 8}$

				Adam
RTE	3.0	2 $\times 10^{- 5}$	16	(0.9, 0.999)	1 $\times 10^{- 8}$
SST-2	3.0	2 $\times 10^{- 5}$	32	(0.9, 0.999)	1 $\times 10^{- 8}$
MRPC	5.0	2 $\times 10^{- 5}$	16	(0.9, 0.999)	1 $\times 10^{- 8}$
QQP	3.0	2 $\times 10^{- 5}$	32	(0.9, 0.999)	1 $\times 10^{- 8}$
MNLI	3.0	2 $\times 10^{- 5}$	32	(0.9, 0.999)	1 $\times 10^{- 8}$
QNLI	3.0	2 $\times 10^{- 05}$	32	(0.9, 0.999)	1 $\times 10^{- 8}$

Note. RTE = recognizing textual entailment.

5.2. Results and Analysis

The multi-exit BERT models trained with different strategies are evaluated, and the accuracy of each exit is summarized in Tables 5 to 8. The bold values indicate the best accuracy for each exit on a given dataset, or the highest accuracy excluding the Oracle strategy.

Table 5.
Accuracy of BERT Model’s Layer 2 Exit Under Different Training Strategies.

SpeedupRatio Method RTE SST-2 MRPC QQP MNLI QNLI

Oracle (BERT-2L) 54.2 $_{0.5}$ 85.1 $_{0.7}$ 71.3 $_{0.2}$ 85.1 $_{0.2}$ 72.3 $_{0.3}$ 73.1 $_{0.7}$

Branch-wise 53.5 $_{0.8}$ 84.9 $_{0.7}$ 69.9 $_{0.7}$ 84.8 $_{1.0}$ 72.0 $_{0.9}$ 69.4 $_{1.3}$

Separate 55.7 $_{0.2}$ 84.7 $_{1.1}$ 70.6 $_{0.4}$ 83.9 $_{0.1}$ 70.6 $_{0.8}$ 66.8 $_{0.9}$

Joint 54.5 $_{1.2}$ 83.8 $_{0.3}$ 71.1 $_{1.3}$ 82.8 $_{1.6}$ 69.1 $_{0.2}$ 63.0 $_{1.1}$

$6 \times$ Two-stage 50.9 $_{0.5}$ 72.8 $_{0.2}$ 68.4 $_{1.1}$ 70.0 $_{1.0}$ 42.7 $_{1.1}$ 59.6 $_{0.8}$

Joint-alt 53.8 $_{1.1}$ 82.8 $_{0.7}$ 69.4 $_{0.9}$ 79.2 $_{0.8}$ 64.3 $_{0.8}$ 62.5 $_{1.7}$

Joint-Boost 53.1 $_{0.6}$ 84.6 $_{0.3}$ 69.6 $_{1.1}$ 85.1 $_{1.2}$ 71.5 $_{1.4}$ 70.2 $_{0.5}$

SWEET 53.1 $_{0.3}$ 84.9 $_{0.9}$ 71.3 $_{0.1}$ 85.0 $_{0.8}$ 72.0 $_{1.5}$ 71.4 $_{0.3}$

RomeBERT 55.2 $_{0.4}$ 84.9 $_{0.6}$ 69.6 $_{1.6}$ 85.0 $_{1.1}$ 71.2 $_{0.7}$ 65.0 $_{1.2}$

Separate Reverse 56.1 $_{0.7}$ 85.0 $_{0.3}$ 72.4 $_{1.3}$ 85.1 $_{0.8}$ 72.2 $_{1.1}$ 75.5 $_{0.5}$

SpeedupRatio	Method	RTE	SST-2	MRPC	QQP	MNLI	QNLI
	Oracle (BERT-2L)	54.2 $_{0.5}$	85.1 $_{0.7}$	71.3 $_{0.2}$	85.1 $_{0.2}$	72.3 $_{0.3}$	73.1 $_{0.7}$
	Branch-wise	53.5 $_{0.8}$	84.9 $_{0.7}$	69.9 $_{0.7}$	84.8 $_{1.0}$	72.0 $_{0.9}$	69.4 $_{1.3}$
	Separate	55.7 $_{0.2}$	84.7 $_{1.1}$	70.6 $_{0.4}$	83.9 $_{0.1}$	70.6 $_{0.8}$	66.8 $_{0.9}$
	Joint	54.5 $_{1.2}$	83.8 $_{0.3}$	71.1 $_{1.3}$	82.8 $_{1.6}$	69.1 $_{0.2}$	63.0 $_{1.1}$
$6 \times$	Two-stage	50.9 $_{0.5}$	72.8 $_{0.2}$	68.4 $_{1.1}$	70.0 $_{1.0}$	42.7 $_{1.1}$	59.6 $_{0.8}$
	Joint-alt	53.8 $_{1.1}$	82.8 $_{0.7}$	69.4 $_{0.9}$	79.2 $_{0.8}$	64.3 $_{0.8}$	62.5 $_{1.7}$
	Joint-Boost	53.1 $_{0.6}$	84.6 $_{0.3}$	69.6 $_{1.1}$	85.1 $_{1.2}$	71.5 $_{1.4}$	70.2 $_{0.5}$
	SWEET	53.1 $_{0.3}$	84.9 $_{0.9}$	71.3 $_{0.1}$	85.0 $_{0.8}$	72.0 $_{1.5}$	71.4 $_{0.3}$
	RomeBERT	55.2 $_{0.4}$	84.9 $_{0.6}$	69.6 $_{1.6}$	85.0 $_{1.1}$	71.2 $_{0.7}$	65.0 $_{1.2}$
	Separate Reverse	56.1 $_{0.7}$	85.0 $_{0.3}$	72.4 $_{1.3}$	85.1 $_{0.8}$	72.2 $_{1.1}$	75.5 $_{0.5}$

Note. RTE = recognizing textual entailment.

Table 6.

Accuracy of BERT Model’s Layer 4 Exit Under Different Training Strategies.

SpeedupRatio	Method	RTE	SST-2	MRPC	QQP	MNLI	QNLI
	Oracle (BERT-4L)	64.3 $_{0.4}$	87.7 $_{0.2}$	77.2 $_{0.5}$	89.0 $_{0.2}$	77.3 $_{0.1}$	85.6 $_{0.4}$
	Branch-wise	62.8 $_{1.0}$	85.0 $_{0.9}$	71.8 $_{1.7}$	86.4 $_{1.1}$	73.4 $_{0.7}$	82.9 $_{1.0}$
	Separate	61.1 $_{0.5}$	85.4 $_{0.6}$	74.5 $_{0.8}$	49.4 $_{0.4}$	75.7 $_{0.8}$	84.2 $_{0.6}$
	Joint	62.8 $_{0.9}$	87.3 $_{0.7}$	76.0 $_{0.2}$	88.6 $_{0.9}$	76.7 $_{0.8}$	85.1 $_{0.7}$
$3 \times$	Two-stage	55.6 $_{0.9}$	77.2 $_{0.7}$	69.9 $_{0.9}$	77.9 $_{0.9}$	55.7 $_{1.7}$	80.2 $_{0.3}$
	Joint-alt	62.1 $_{0.8}$	85.9 $_{1.3}$	72.3 $_{0.7}$	87.7 $_{1.7}$	74.8 $_{0.6}$	84.1 $_{1.4}$
	Joint-Boost	62.8 $_{0.1}$	88.1 $_{0.2}$	75.0 $_{0.1}$	88.1 $_{1.2}$	76.4 $_{1.5}$	85.4 $_{1.1}$
	SWEET	62.8 $_{0.1}$	87.3 $_{0.2}$	75.0 $_{0.5}$	87.1 $_{1.3}$	74.5 $_{0.8}$	84.1 $_{0.4}$
	RomeBERT	61.7 $_{0.3}$	88.2 $_{0.7}$	75.5 $_{1.7}$	88.5 $_{1.3}$	77.8 $_{0.7}$	85.6 $_{0.6}$
	Separate Reverse	63.2 $_{0.7}$	88.5 $_{0.4}$	74.8 $_{0.6}$	89.8 $_{1.0}$	77.9 $_{0.5}$	84.8 $_{1.1}$

Note. RTE = recognizing textual entailment.

Table 7.

Accuracy of BERT Model’s Layer 6 Exit Under Different Training Strategies.

SpeedupRatio	Method	RTE	SST-2	MRPC	QQP	MNLI	QNLI
	Oracle (BERT-6L)	66.4 $_{0.2}$	90.6 $_{0.4}$	82.1 $_{0.7}$	89.9 $_{0.4}$	81.0 $_{0.2}$	87.1 $_{0.2}$
	Branch-wise	63.2 $_{0.7}$	84.9 $_{0.6}$	73.3 $_{0.4}$	86.5 $_{1.4}$	73.7 $_{1.4}$	83.6 $_{0.4}$
	Separate	62.8 $_{0.3}$	51.3 $_{0.5}$	74.8 $_{0.8}$	71.3 $_{0.8}$	75.9 $_{1.5}$	84.4 $_{0.1}$
	Joint	63.2 $_{1.3}$	89.9 $_{1.1}$	78.4 $_{0.6}$	90.1 $_{1.2}$	80.6 $_{0.5}$	87.2 $_{0.7}$
$2 \times$	Two-stage	59.9 $_{0.2}$	81.8 $_{0.6}$	73.3 $_{0.7}$	82.7 $_{1.0}$	69.9 $_{1.5}$	85.0 $_{1.5}$
	Joint-alt	62.1 $_{1.0}$	88.9 $_{0.7}$	80.6 $_{0.6}$	89.3 $_{1.4}$	79.3 $_{1.7}$	86.5 $_{0.9}$
	Joint-Boost	63.5 $_{0.3}$	89.3 $_{0.3}$	76.5 $_{0.6}$	89.3 $_{1.6}$	79.9 $_{0.3}$	87.5 $_{1.1}$
	SWEET	62.8 $_{0.5}$	90.1 $_{0.4}$	75.2 $_{0.8}$	87.9 $_{1.5}$	76.8 $_{0.3}$	85.7 $_{0.7}$
	RomeBERT	63.1 $_{0.7}$	88.8 $_{0.4}$	80.6 $_{0.7}$	89.7 $_{0.3}$	81.0 $_{0.1}$	87.3 $_{1.2}$
	Separate Reverse	65.7 $_{0.3}$	88.6 $_{0.3}$	81.1 $_{1.1}$	90.1 $_{0.9}$	79.5 $_{0.9}$	88.3 $_{0.6}$

Note. RTE = recognizing textual entailment.

Table 8.

Accuracy of BERT Model’s Layer 12 Exit Under Different Training Strategies.

SpeedupRatio	Method	RTE	SST-2	MRPC	QQP	MNLI	QNLI
	Oracle (BERT-12L)	64.3 $_{0.5}$	92.5 $_{0.4}$	85.8 $_{0.4}$	91.0 $_{0.7}$	84.4 $_{0.3}$	90.0 $_{0.5}$
	Branch-wise	64.3 $_{0.7}$	88.6 $_{0.4}$	82.8 $_{0.9}$	86.6 $_{1.0}$	77.9 $_{0.1}$	88.2 $_{0.7}$
	Separate	65.0 $_{0.8}$	85.4 $_{0.3}$	75.0 $_{0.1}$	88.8 $_{0.6}$	75.9 $_{0.6}$	84.4 $_{0.1}$
	Joint	63.5 $_{0.4}$	91.9 $_{0.5}$	83.8 $_{0.1}$	91.1 $_{1.1}$	84.2 $_{1.3}$	90.9 $_{0.8}$
$1 \times$	Two-stage	65.3 $_{0.4}$	92.4 $_{0.6}$	86.0 $_{0.7}$	91.0 $_{0.5}$	84.5 $_{0.5}$	90.7 $_{1.0}$
	Joint-alt	64.6 $_{0.9}$	91.9 $_{0.5}$	83.3 $_{0.7}$	92.0 $_{1.2}$	84.4 $_{1.7}$	89.5 $_{1.3}$
	Joint-Boost	64.3 $_{0.5}$	92.1 $_{0.7}$	85.0 $_{0.5}$	90.7 $_{0.3}$	84.3 $_{1.0}$	89.6 $_{1.1}$
	SWEET	65.0 $_{0.8}$	90.7 $_{0.2}$	84.6 $_{0.6}$	89.5 $_{1.3}$	81.9 $_{0.9}$	88.8 $_{0.8}$
	RomeBERT	66.5 $_{0.4}$	91.4 $_{0.9}$	82.1 $_{1.2}$	90.9 $_{1.5}$	84.4 $_{1.0}$	90.2 $_{0.9}$
	Separate Reverse	66.1 $_{0.2}$	92.4 $_{0.3}$	84.6 $_{0.6}$	90.1 $_{0.6}$	83.1 $_{1.0}$	91.0 $_{1.0}$

Note. RTE = recognizing textual entailment.

As shown in Table 5, across all datasets, the Separate Reverse training strategy achieves the highest (or second-highest, excluding Oracle) accuracy at the shallowest exit compared with other methods. This demonstrates that Separate Reverse effectively mitigates gradient conflicts and enables shallow exits to approach the performance of the Oracle model. Moreover, Separate Reverse allows shallow layers to retain useful high-level semantic knowledge from deeper layers, sometimes even surpassing the Oracle model’s accuracy at specific exits. Compared to the strongest non-Oracle baselines, Separate Reverse yields consistent gains at shallow exits, ranging from marginal improvements to over 4% on certain datasets, underscoring its effectiveness in mitigating gradient conflicts.

As shown in Tables 6 to 8, Separate Reverse also outperforms most baselines under $3 \times$ and $2 \times$ speed-up ratios, respectively. Although training exit classifiers from deep to shallow may weaken the performance of deeper exits, the hierarchical knowledge distillation employed helps preserve feature extraction capabilities across layers, preventing catastrophic forgetting observed in the Separate training strategy. When the hyperparameter $α = 0.5$ , the accuracy gap between Separate Reverse and other strategies (excluding Oracle) is less than 1.6%, 1.9%, and 2.1% at the fourth, sixth, and 12th exits, respectively. These results indicate that Separate Reverse achieves a strong tradeoff: substantial gains at shallow exits while keeping main-exit degradation minimal (mostly $< 2 %$ ) compared to stronger baselines such as RomeBERT and Joint.

To further evaluate the inference acceleration of Multi-exit BERT trained with Separate Reverse, we use the maximum logit at each exit as the confidence score. A sample exits once its maximum logit exceeds a predefined threshold. The threshold is varied from 0 to 1 with a step size of 0.005, and the corresponding speed-up ratios are measured. As shown in Figure 6, when the threshold is very small, nearly all samples satisfy the confidence requirement at the shallowest exit, leading to the highest speed-up ratios for all strategies. As the threshold increases, the exiting criterion becomes stricter, forcing more samples to pass through deeper layers, and the speed-up ratio gradually decreases. Compared with joint optimization strategies, Separate Reverse trains each exit individually in a deep-to-shallow manner, which strengthens intermediate classifiers and enables more samples to exit early under higher thresholds. Consequently, it maintains higher speed-up ratios, particularly in the medium-to-high threshold regime. Figure 7 further illustrates the proportion of samples exiting at each layer when the threshold is 0.6, where Separate Reverse yields a larger share of shallow exits. Figure 8 presents the accuracy under different speed-up ratios. Separate Reverse achieves a more favorable speedup-accuracy trade-off: under the same acceleration constraint, it consistently attains higher accuracy than competing strategies. This improvement stems from reduced gradient interference among exits, which enhances shallow-exit reliability, while hierarchical knowledge distillation preserves the performance of deeper exits. As a result, Separate Reverse maintains competitive performance even at low speed-up ratios.

5.3. Ablation Study

To evaluate the effectiveness of hierarchical knowledge distillation, this section removes the distillation step and retrains the model. The ablation results are shown in Table 9. Knowledge distillation helps preserve information learned from the main output during intermediate-layer training, thereby maintaining the performance of deeper exits. When distillation is removed, the accuracy of the main exit drops significantly, and in some cases—such as the sixth-layer exits on MNLI and QNLI—the intermediate exits even outperform the main one. This indicates that as shallow exits are trained, model parameters gradually forget deeper knowledge, leading to degraded downstream performance. Introducing hierarchical distillation mitigates this issue by constraining parameter updates under the supervision of the previous iteration, preventing shallow layers from overfitting to intermediate features while forgetting learned semantic knowledge. Overall, removing distillation causes an average main-exit drop of 3.5%–6.8% across datasets, underscoring the critical role of hierarchical knowledge transfer in preventing catastrophic forgetting.

To further examine the effect of the balance coefficient $α$ , experiments are conducted with $α = 0.1$ and $α = 0.9$ , as shown in Figure 9. When $α$ is small, the proportion of distillation loss increases, slightly improving the accuracy of the main exit by reducing performance degradation from shallow updates. However, the improvement saturates quickly—at $α = 0.5$ , the main exit already achieves accuracy close to or even exceeding that of Oracle, so lowering $α$ further provides little benefit while significantly harming intermediate exits, leading to lower accuracy under high speed-up conditions. Conversely, an excessively large $α$ (e.g., 0.9) enhances the performance of high-speed exits but causes a notable decline at the main exit. Thus, a larger $α$ suits inference scenarios dominated by simple inputs, where the goal is to let easy samples exit early with both high accuracy and speed, while complex ones may be less emphasized.

Figure 9.

Comparison of Separate Reverse with different balance coefficients.

Table 9.

Ablation Study Results of Hierarchical Knowledge Distillation.

		Accuracy
Dataset	Method	Exit Layer2	Exit Layer4	Exit Layer6	Exit Layer12
RTE	Separate Reverse	56.1 $_{0.7}$	63.2 $_{0.7}$	65.7 $_{0.3}$	66.1 $_{0.2}$
	w/o Distill	56.6 $_{0.9}$	61.0 $_{0.5}$	66.1 $_{1.1}$	65.0 $_{1.5}$
SST-2	Separate Reverse	85.0 $_{0.3}$	88.5 $_{0.4}$	88.6 $_{0.3}$	92.4 $_{0.3}$
	w/o Distill	85.3 $_{0.8}$	85.4 $_{0.7}$	84.7 $_{0.4}$	89.7 $_{0.8}$
MRPC	Separate Reverse	72.4 $_{1.3}$	74.8 $_{0.6}$	81.1 $_{1.1}$	84.6 $_{0.6}$
	w/o Distill	73.2 $_{0.8}$	77.0 $_{0.7}$	76.4 $_{0.7}$	73.5 $_{0.9}$
QQP	Separate Reverse	85.1 $_{0.8}$	89.8 $_{1.0}$	90.1 $_{0.9}$	90.1 $_{0.6}$
	w/o Distill	85.3 $_{1.1}$	87.7 $_{0.5}$	87.7 $_{0.6}$	86.5 $_{1.3}$
MNLI	Separate Reverse	72.2 $_{1.1}$	77.9 $_{0.5}$	79.5 $_{0.9}$	83.1 $_{1.0}$
	w/o Distill	71.8 $_{0.3}$	74.1 $_{0.8}$	73.0 $_{1.2}$	68.5 $_{0.7}$
QNLI	Separate Reverse	75.5 $_{0.5}$	84.8 $_{1.1}$	88.3 $_{0.6}$	91.0 $_{1.0}$
	w/o Distill	71.3 $_{0.5}$	83.6 $_{1.1}$	82.4 $_{0.9}$	76.4 $_{1.4}$

Note. RTE = recognizing textual entailment.

Overall, comparing the speed-up curves for $α = 0.1$ , $α = 0.5$ , and $α = 0.9$ shows that an appropriate balance coefficient enables BERT to not only learn high-level semantic features but also acquire the ability to identify simple samples for early exiting, without significantly compromising the main exit’s performance. With $α = 0.5$ , Separate Reverse generally lies on or above other strategies along the accuracy–speedup Pareto front, offering the most favorable efficiency–accuracy balance across a wide range of confidence thresholds.

6. Conclusion

In this paper, we propose a multi-exit training strategy for pretrained transformer models to mitigate gradient conflicts and improve early-exit performance. The strategy combines the strengths of existing methods, maintaining pretrained parameter integrity while balancing optimization across shallow and deep exits. By coordinating gradient updates with hierarchical knowledge distillation, it enhances the accuracy of shallow exits without degrading the main exit. Experimental results on the GLUE benchmark demonstrate that our approach achieves an effective tradeoff between accuracy and inference speed, enabling efficient early exiting for simpler samples.

Footnotes

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by the Research on Key Technologies of Efficient Cloud-Edge-End Collaborative Computing for Computer Vision in Electric Power System (No. J2024147).

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Aghajanyan

Conneau

Hsu

W.-N.

Hambardzumyan

Zhang

Roller

Goyal

Levy

Zettlemoyer

(2023). Scaling laws for generative mixed-modal language models. In Proceedings of the 40th international conference on machine learning (ICML'23) (Vol. 202, pp. 265–279). JMLR.org.

Bajpai

D. J.

Hanawal

M. K.

(2024). CeeBERT: Cross-domain inference in early exit BERT. arXiv preprint arXiv:2405.15039.

Cambria

White

(2014). Jumping NLP curves: A review of natural language processing research. IEEE Computational Intelligence Magazine, 9(2), 48–57.

Chen

Fan

Huang

Guddanti

K. P.

(2024). Artificial intelligence/machine learning technology in power system applications. Pacific Northwest National Laboratory (PNNL).

Chen

Pan

Ding

Zhou

(2023). EE-LLM: Large-scale training and inference of early-exit large language models with 3D parallelism. arXiv preprint arXiv:2312.04916.

de Barcelos Silva

Gomes

M. M.

Da Costa

C. A.

Righi

Rosa

Barbosa

J. L. V.

Pessin

De Doncker

Federizzi

(2020). Intelligent personal assistants: A systematic literature review. Expert Systems with Applications, 147, 113193.

Devlin

Chang

M.-W.

Lee

Toutanova

(2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: Human language technologies (NAACL-HLT) (Vol. 1, pp. 4171–4186). Association for Computational Linguistics.

Gao

Liu

Huang

Hou

(2023). PF-BERxiT: Early exiting for BERT with parameter-efficient fine-tuning and flexible early exiting strategy. Neurocomputing, 558, 126690.

Geng

Gao

Zhang

(2021). RomeBERT: Robust training of multi-exit BERT. arXiv preprint arXiv:2101.09755.

10.

Gou

Maybank

S. J.

Tao

(2021). Knowledge distillation: A survey. International Journal of Computer Vision, 129(6), 1789–1819.

11.

Hou

Huang

Shang

Jiang

Chen

Liu

(2020). DynabERT: Dynamic BERT with adaptive width and depth. Advances in Neural Information Processing Systems, 33, 9782–9793.

12.

Huang

Chen

Van Der Maaten

Weinberger

K. Q.

(2017). Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844.

13.

Wang

Chen

Zhang

(2023). Early exit with disentangled representation and equiangular tight frame. In Findings of the Association for Computational Linguistics: ACL 2023 (pp. 14128–14142). Association for Computational Linguistics.

14.

Kurtic

Campos

D. F.

Nguyen

Frantar

Kurtz

Fineran

Goin

Alistarh

(2022). The optimal BERT surgeon: Scalable and accurate second-order pruning for large language models. arXiv preprint arXiv:2203.07259.

15.

Laskaridis

Kouris

Lane

N. D.

(2021). Adaptive inference through early-exit networks: Design, challenges and directions. In Proceedings of the 5th International workshop on embedded and mobile deep learning (MobiSys ’21) (pp. 1–6). ACM.

16.

Lattanzi

Contoli

Freschi

(2023). Do we need early exit networks in human activity recognition?. Engineering Applications of Artificial Intelligence, 121, 106035.

17.

Liao

Couillet

Mahoney

M. W.

(2020). Sparse quantized spectral clustering. arXiv preprint arXiv:2010.01376.

18.

Liu

Hao

Liu

Weng

Wang

F. L.

(2023). OdeBERT: One-stage deep-supervised early-exiting BERT for fast inference in user intent classification. ACM Transactions on Asian and Low-Resource Language Information Processing, 22, 1–18.

19.

Liu

Tao

Feng

Zhao

(2022). Multi-granularity structural knowledge distillation for language model compression. In Proceedings of the 60th annual meeting of the Association for Computational Linguistics (Volume 1: Long papers) (pp. 1001–1011). Association for Computational Linguistics.

20.

Liu

Zhou

Zhao

Wang

Deng

(2020). FastBERT: A self-distilling BERT with adaptive inference time. arXiv preprint arXiv:2004.02178.

21.

Liu

Zhu

Belkin

(2022). Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis 59, 85–116.

22.

Michel

Levy

Neubig

(2019). Are sixteen heads really better than one? In Proceedings of the 33rd international conference on neural information processing systems (pp. 14037–14047). Curran Associates Inc.

23.

Rahmath

Haseena

Srivastava

Chaurasia

Pacheco

R. G.

Couto

R. S.

(2024). Early-exit deep neural network: A comprehensive survey. ACM Computing Surveys, 57(3), 1–37.

24.

Rotem

Hassid

Mamou

Schwartz

(2023). Finding the SWEET spot: Analysis and improvement of adaptive inference in low resource settings. In Proceedings of the 61st annual meeting of the association for computational linguistics (Vol. 1: Long Papers, pp. 14836–14851). Association for Computational Linguistics.

25.

Schuster

Fisch

Gupta

Dehghani

Bahri

Tran

V. Q.

Tay

Metzler

(2022). Confident adaptive language modeling. arXiv preprint arXiv:2207.07061.

26.

Tang

Wang

Kong

Zhang

Ding

Wang

Liang

(2023). You need multiple exiting: Dynamic early exiting for accelerating unified vision language model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10781–10791). IEEE.

27.

Treviso

Lee

J-U.

van Aken

Cao

Ciosici

M. R.

Hassid

Heafield

Hooker

Raffel

Martins

P. H.

Martins

A. F. T.

Forde

J. Z.

Milder

Simpson

Slonim

Dodge

Strubell

Balasubramanian

Derczynski

Gurevych

Schwartz

(2023). Efficient methods for natural language processing: A survey. Transactions of the Association for Computational Linguistics, 11, 826–860.

28.

Wang

Singh

Michael

Hill

Levy

Bowman

S. R.

(2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP (pp. 353–355). Association for Computational Linguistics.

29.

Xin

Tang

Lee

Lin

(2020). DeeBERT: Dynamic early exiting for accelerating BERT inference. arXiv preprint arXiv:2004.12993.

30.

Xin

Tang

Lin

(2021). BERxiT: Early exiting for BERT with better fine-tuning and extension to regression. In Proceedings of the 16th conference of the European Chapter of the Association for Computational Linguistics (EACL 2021) (pp. 91–104). Association for Computational Linguistics.

31.

Xin

Tang

Lin

J. J.

(2021). BERxiT: Early exiting for BERT with better fine-tuning and extension to regression. In Proceedings of the 16th conference of the European Chapter of the Association for Computational Linguistics (EACL) (pp. 91–104). Association for Computational Linguistics .

32.

Pan

Zhou

Chen

Lian

Dai

(2025). Specee: Accelerating large language model inference with speculative early exiting. In Proceedings of the 52nd annual international symposium on computer architecture (ISCA) (pp. 467–481).

33.

Yin

Jin

Zhang

Wei

Liu

(2023). LLMCAD: Fast and scalable on-device large language model inference. arXiv preprint arXiv:2309.04255.

34.

Guo

Wei

Zhou

Wang

(2025). EdgeMoE: Empowering sparse large language models on mobile devices. arXiv preprint arXiv:2308.14352.

35.

Hua

Huang

Shi

(2022). Boosted dynamic neural networks. arXiv preprint arXiv:2211.16726.

36.

Zhu

Wang

Xie

Wang

(2023). BADGE: Speeding up BERT inference after deployment via block-wise bypasses and divergence-based early exiting. In Proceedings of the 61st annual meeting of the Association for Computational Linguistics (Volume 5: Industry track) (pp. 500–509). Association for Computational Linguistics.

				Adam
Dataset	Epoch	Learning rate	Batch size	Beta	Epsilon
RTE	3.0	2 $\times 10^{- 5}$	16	(0.9, 0.999)	1 $\times 10^{- 8}$
SST-2	3.0	2 $\times 10^{- 5}$	32	(0.9, 0.999)	1 $\times 10^{- 8}$
MRPC	5.0	2 $\times 10^{- 5}$	16	(0.9, 0.999)	1 $\times 10^{- 8}$
QQP	3.0	2 $\times 10^{- 5}$	32	(0.9, 0.999)	1 $\times 10^{- 8}$
MNLI	3.0	2 $\times 10^{- 5}$	32	(0.9, 0.999)	1 $\times 10^{- 8}$
QNLI	3.0	2 $\times 10^{- 05}$	32	(0.9, 0.999)	1 $\times 10^{- 8}$

Separate Reverse: A Gradient-Conflict-Free Training Framework for Multi-Exit Transformers

Abstract

Keywords

1. Introduction

2.1. Model Compression

2.2. Multi-Exit Mechanism

3. Challenges of Multi-Exit Transformers

3.1. Execution Mechanism and Gradient Conflicts of Multi-Exit Transformers

4.1. Model Fine-Tuning and Exit Configuration

5.1. Experimental Setup

5.1.1. Dataset and Metrics

Table 3. Dataset Statistics. Dataset Labels Train/Dev RTE 2 2.5k/0.3k SST-2 2 67k/0.9k MRPC 2 3.7k/0.4k QQP 2 364k/40k MNLI 3 393k/9.8k QNLI 2 105k/5.5k

Footnotes

Funding

Declaration of Conflicting Interests

References

Table 3.
Dataset Statistics.

Dataset Labels Train/Dev

RTE 2 2.5k/0.3k

SST-2 2 67k/0.9k

MRPC 2 3.7k/0.4k

QQP 2 364k/40k

MNLI 3 393k/9.8k

QNLI 2 105k/5.5k