ArMT-TNN: Enhancing natural language understanding performance through hard parameter multitask learning in Arabic

Abstract

Multitask learning (MTL) is a machine learning paradigm where a single model is trained to perform several tasks simultaneously. Despite the considerable amount of research on MTL, the majority of it has been centered around English language, while other language such as Arabic have not received as much attention. Most existing Arabic NLP techniques concentrate on single or multitask learning, sharing just a limited number of tasks, between two or three tasks. To address this gap, we present ArMT-TNN, an Arabic Multi-Task Learning using Transformer Neural Network, designed for Arabic natural language understanding (ANLU) tasks. Our approach involves sharing learned information between eight ANLU tasks, allowing for a single model to solve all of them. We achieve this by fine-tuning all tasks simultaneously and using multiple pre-trained Bidirectional Transformer language models, like BERT, that are specifically designed for Arabic language processing. Additionally, we explore the effectiveness of various Arabic language models (LMs) that have been pre-trained on different types of Arabic text, such as Modern Standard Arabic (MSA) and Arabic dialects. Our approach demonstrated outstanding performance compared to all current models on four test sets within the ALUE benchmark, namely MQ2Q, OOLD, SVREG, and SEC, by margins of 3.9%, 3.8%, 10.1%, and 3.7%, respectively. Nonetheless, our approach did not perform as well on the remaining tasks due to the negative transfer of knowledge. This finding highlights the importance of carefully selecting tasks when constructing a benchmark. Our experiments also show that LMs which were pretrained on text types that differ from the text type used for finetuned tasks can still perform well.

Keywords

Multitask learning natural language understanding arabic language models transfer learning hard parameter sharing

1. Introduction

In recent years, significant progress has been made in the field of artificial intelligence (AI), particularly in Natural Language Processing (NLP), thanks to advancements in deep learning. Deep neural networks aim to optimize millions of parameters gradually to solve complex problems. Transfer learning has emerged as a leading approach in modern AI, involving learning to solve a general problem first and then transferring that knowledge to solve a specific task. In NLP, pre-trained language models that optimize general language tasks, such as Masked Language Model (MLM) and Next Sentence Prediction (NSP), are used to achieve this goal by training them on large amounts of unlabeled data.

The popularity of language models has grown in recent years, particularly with the introduction of Bidirectional Encoder Representations from Transformers (BERT) by Jacob Devlin et al. [1]. BERT has set the state-of-the-art in most NLP problems, owing to its ability to capture contextualized meaning and richer representations. It is a pre-trained language model that utilizes transfer learning and has been trained on vast amounts of unlabeled data to optimize MLM and NSP objectives. BERT can be used for feature extraction or fine-tuning, where feature extraction involves using the pre-trained BERT parameters without updating them, while fine-tuning involves fine-tuning any number of layers or parameters for a specific task using labeled data.

Traditional machine learning models focus on performing a single task, ignoring the fact that humans learn tasks by building upon related tasks. In NLP, many language tasks may share some degree of relatedness [2]. Multi-task learning (MTL) is a machine learning approach that involves learning multiple tasks simultaneously with the goal of improving generalization across all tasks by utilizing helpful information shared across related tasks [3]. In the context of language models, the general MTL architecture involves sharing layers/parameters, either fully or partially, of the language model across multiple tasks. This enables millions of parameters to receive signals during training, allowing them to learn about each task. The parameters can be shared through either hard parameter sharing or soft parameter sharing [4]. Hard parameter sharing involves sharing a single model’s weights across multiple tasks, with each task having its own task-specific layer/classifier, and the objective is to optimize multiple loss functions. Soft parameter sharing, on the other hand, refers to each task having its own model, but the distance between the models’ parameters are regularized.

Despite the many advantages of MTL, it remains an active area of research in NLP. Because MTL involves training multiple tasks simultaneously, it is possible to have some tasks that have conflicting knowledge, which can lead to negative transfer. This problem occurs when reducing the loss of one task increases the loss of another task. Various methods have been proposed to mitigate the impact of negative transfer, such as loss weighting, which involves balancing multiple loss functions into a scalar [5]. Another method is task scheduling, which aims to identify which tasks should be trained together at each training step [6]. Other methods include task relationship and knowledge distillation. For more details, readers can refer to [3]. However, dealing with negative transfer is a broad topic beyond the scope of this work.

Recently, there has been a shift in research towards MTL in English language understanding, thanks to the development of the General Language Understanding Evaluation (GLUE) benchmark [7]. However, MTL in other languages, such as Arabic, has been slow to progress due to the absence of well-developed benchmarks. To bridge this gap, a new Arabic Natural Language Understanding (NLU) benchmark called the ALUE has been introduced [8].1

¹
Leaderboard https://www.alue.org/.

The ALUE benchmark is gaining popularity in the Arabic NLP community for evaluating NLU models. While our work was in development, MTL had not yet been explored in the ALUE benchmark, which motivated us to introduce this approach and inspire further research in this direction.

Our paper focuses on utilizing a BERT-based architecture with hard parameter sharing across multiple Arabic NLU tasks. We evaluate MTL performance only on the tasks offered in the ALUE benchmark and investigate the capabilities of different pre-trained Arabic LLMs when fine-tuned on specific tasks with varying text types. The objectives of the research are presented below:

•

To examine the effectiveness of hard parameters sharing strategy in enhancing Arabic NLU tasks performance in various tasks.

•

To explore the applicability of different pretrained Arabic LMs, including those trained on text types different from the finetuned tasks.

•

To demonstrate the impact of positive and negative transfer learning on the ALUE benchmark tasks.

The main research contributions can be outlined as follows:

•

To the best of our knowledge, this is the first attempt to study MTL on a larger scale of Arabic NLU tasks, compared to prior work.

•

Enhancing the performance of various Arabic NLU tasks in the ALUE benchmark.

•

Conducting a thorough analysis of the results and offering recommendations and suggestions for future research.

The paper is organized as follows. In Section 2, we review related work on multitask learning in both English and Arabic. In Section 3, we provide details on the resources used in this study, including the Arabic language models and benchmark datasets. Section 4 presents our proposed ArMT-TNN, a multitask learning framework with hard parameter sharing for NLU tasks. In Section 5, we provide experimental setup and implementation details of this work, including training procedure and hyperparameters. Section 6 reports the results of the ALUE benchmark tasks and provides a detailed discussion on the obtained results, including the analysis of positive and negative transfer of knowledge. Finally, we conclude our study in Section 7, where we summarize the main contributions of this work and outline directions for future research.

2. Related work

This section discusses various studies and research works that have explored the benefits of using MTL for various English and Arabic NLP tasks. The section highlights the advantages of using MTL for tasks such as detecting hate speech and offensive language, sarcasm and sentiment analysis, cross-lingual abstractive summarization, Modern Standard Arabic (MSA) and dialect identification, fake news detection, and named entity recognition. The studies discussed in this section have shown promising results in using MTL for these tasks compared to other traditional approaches.

2.1 MTL in english NLP

In this section, we review the related work on MTL in NLP, specifically the research that employs Language Models (LM) based on the Transformer architecture [9] in both English and Arabic.

One of the earliest works that utilized BERT as a shared text encoder layer in MTL was presented in [10]. The lower layers of this MTL architecture are shared across all tasks, while the top layers are specific to each task. This method achieved state-of-the-art performance on eight out of nine GLUE tasks. In a similar vein, [11] investigated various fine-tuning strategies for text classification tasks, including the MTL paradigm, and concluded that MTL was helpful in some cases.

To improve the knowledge transfer in MTL, Liu et al. [12] and Clark et al. [13] utilized knowledge distillation. This technique aims to distill knowledge from a set of single-task models (teachers) into a single multitask model. Both works showed improvements over traditional MTL.

Recently, Aghajanyan et al. [14] and Aribandi et al. [15] proposed a new strategy to better exploit the benefits of MTL, called massive MTL. This approach involves extensively prefinetuning the model’s parameters on a large-scale dataset, including a massive amount of labeled data on a wide range of tasks. Both works reported noticeable improvements over the vanilla MTL approach across several tasks.

MTL has also been shown to improve the performance of various NLP applications. For instance, Kim et al. [16] and Li et al. [17] utilized MTL for knowledge graph completion and linkage between entities. MTL is an effective method to leverage knowledge in the semantic network. Additionally, Morio et al. [18] proposed an MTL framework for spoken language understanding, where the input is speech, and the MTL system produces several NLP tasks such as question answering, intents, and named entity recognition. Other NLP applications, such as argument mining [18], biomedical text mining [19], and peer assessments [20], multimodal aspect sentiment analysis [21], and emotion intensity for detecting suicidal tendencies [22] have shown that MTL produces better representations of knowledge and is robust to errors.

In recent years, there has been a growing interest in incorporating prompt-based learning approach into MTL paradigm. A few recent studies have demonstrated that prompt-based learning can facilitate the transfer of knowledge in language models with zero-shot and few-shot learning [23, 24, 25, 26, 27]. This is achieved through extensive multitask learning across a wide range of tasks utilizing prompt learning.

Based on the related work discussed above, it can be concluded that MTL has shown promising results in various English NLP tasks such as detecting hate speech and offensive language, sarcasm and sentiment analysis, cross-lingual abstractive summarization, MSA and dialect identification, fake news detection, and named entity recognition. The studies discussed in this section have utilized Language Models (LM) based on the Transformer architecture and have explored various MTL architectures and approaches such as knowledge distillation, massive MTL and prompt learning.

2.2 MTL in Arabic NLP

Recently, there has been a growing interest in utilizing MTL for various Arabic NLP tasks. For instance, Farha and Magdy [28] investigated the effectiveness of different approaches such as BiLSTM, CNN, CNN-BiLSTM, and BERT MTL for detecting hate speech and offensive language. Other works, such as [29, 30, 31], also explored the advantages of using MTL in this area. Additionally, sarcasm and sentiment analysis have also shown improvements with MTL. Mahdaouy et al. [32] proposed an attention interaction layer on top of a BERT task-specific dense layer, which achieved promising results in both tasks. Similarly, Alharbi and Lee [33] introduced a method that incorporates three models: word embeddings, contextualized embeddings, and MTL, which achieved the best performance in sarcasm detection.

Another application of MTL in NLP is cross-lingual abstractive summarization, where Takase and Okazaki [34] proposed a MTL framework called Transum, which utilizes translation pairs and monolingual sentence summaries. This method achieved top ROUGE scores in both Chinese-English and Arabic-English abstractive summarization. Additionally, applied El Mekki et al. [35] applied MTL for MSA and dialect identification at both country and province levels, and the results showed that MTL outperformed the task model. Moreover, MTL approaches have outperformed other methods in fake news detection [36] and named entity recognition [37].

The above Arabic MTL studies have shown promising results in various NLP tasks, such as detecting hate speech and offensive language, sarcasm and sentiment analysis, cross-lingual abstractive summarization, MSA and dialect identification, fake news detection, and named entity recognition.

However, there are still some gaps and challenges in the existing studies. For example, there is a lack of large-scale labeled datasets in Arabic NLP, which limits the application of MTL in Arabic NLP. Additionally, most of the existing studies focus on a limited number of tasks, and there is a need for more comprehensive studies that cover a wide range of Arabic NLP tasks. Moreover, some studies do not compare their MTL approach with other traditional approaches, making it difficult to evaluate the effectiveness of MTL.

3. Arabic language resources

This section describes the Arabic language resources used in our study, including pretrained language models and datasets for natural language understanding tasks. Our resources mainly consist of Arabic pretrained language models and a collection of NLU datasets. Several Arabic pretrained language models have been made available for public use in recent years, and we have selected language models that focus on dialectal Arabic. For our NLU datasets, we use the ALUE, a recently proposed benchmark for NLU.

3.1 Arabic pretrained language models

This subsection presents an overview of the Arabic pretrained language models utilized in our study, with a focus on models developed for dialectal Arabic. The following Arabic pretrained language models are included in our resources, all of which are based on the BERT architecture and available on the Huggingface library.

1.
AraBERT-v0.2: It is an extension of AraBERTv01 [38], pretrained on MSA with a size of 77GB and 8.6 billion words.
2.
Multi-dialect-Arabic-BERT: This model [39] is an extension of ArabicBERT [40], and was pretrained on 10 million tweets.
3.
MARBERT: This model [41] was pretrained on Arabic tweets with a total of 15.6 billion tokens.

Table 1
Statistics of train, dev, and test sets of tasks in the ALUE benchmark

Task Train Dev Test Text type Task type

SEC 2.3k 600 1.5k DIAL Single-sentence classification – 11 labels

MDD 42k 5.2k 5.2K DIAL Single-sentence classification – 26 labels

FID 4k – 1k DIAL Single-sentence classification – 2 labels

MQ2Q 12k – 3.7 MSA Sentence-pair classification – 2 labels

XNLI 5k – 2.5k MSA Sentence-pair classification – 3 labels

OHSD 7k 1k 2k DIAL Single-sentence classification – 2 labels

SVREG 900 100 700 DIAL Single-sentence regression – (0–1)

OOLD 7k 1k 2k DIAL Single-sentence classification – 2 labels

3.2 Natural language understanding datasets

Task	Train	Dev	Test	Text type	Task type
SEC	2.3k	600	1.5k	DIAL	Single-sentence classification – 11 labels
MDD	42k	5.2k	5.2K	DIAL	Single-sentence classification – 26 labels
FID	4k	–	1k	DIAL	Single-sentence classification – 2 labels
MQ2Q	12k	–	3.7	MSA	Sentence-pair classification – 2 labels
XNLI	5k	–	2.5k	MSA	Sentence-pair classification – 3 labels
OHSD	7k	1k	2k	DIAL	Single-sentence classification – 2 labels
SVREG	900	100	700	DIAL	Single-sentence regression – (0–1)
OOLD	7k	1k	2k	DIAL	Single-sentence classification – 2 labels

This subsection presents an overview of the ALUE benchmark, which comprises eight tasks that evaluate NLU performance. The tasks are described in detail below.

•
MQ2Q: Pairwise semantic question similarity task with binary classification. F1-score evaluation [42].
•
OOLD & OHSD: Binary classification tasks for detecting offensive and hate language respectively. F1-score evaluation [43].
•
SEC: Emotion classification task with inputs belonging to one or more of eleven possible classes. Jaccard similarity score evaluation [44].
•
SVREG: Sentiment intensity regression task determining valence intensity of inputs by a real-valued score between 0 and 1, where 0 is most negative and 1 is most positive. Pearson correlation coefficient evaluation [44].
•
FID: Irony detection task classifying inputs into 1 if the text is ironic, and 0 otherwise. F1-score evaluation [45].
•
XNLI: Classification task for textual entailment sentence pairs. Each pair is labeled as entailment, neutral, or contradiction. Accuracy evaluation [46].
•
MDD: Arabic dialect identification task classifying each input into one of 26 labels corresponding to one city in the Arab world, including MSA. F1-score evaluation [47].

Table 1 presents the statistics of the train, development, and test sets of tasks in the ALUE benchmark. There are eight tasks listed in the table, namely SEC, MDD, FID, MQ2Q, XNLI, OHSD, SVREG, and OOLD. The text type involved in the tasks is either MSA or Arabic dialects (DIAL), and each task is described in terms of the inputs and associated labels. The number of training, development, and test samples are shown for each task, along with the task type, which could be either single-sentence classification with 2 or 11 labels, sentence-pair classification with 2 or 3 labels, or single-sentence regression with scores between 0 and 1.
4. The proposed arabic multi-task learning using transformer neural network system (ArMT-TNN)

This section describes the architecture and functionality of the ArMT-TNN system. It is designed as a Multitask Deep Neural Network System that enables the simultaneous learning of multiple tasks with the aim of positively transferring knowledge among related tasks. The proposed approach is a straightforward implementation of MTL that shares all parameters across all tasks using the Bidirectional Encoder Representation from Transformers (BERT) pretrained language model. Fine-tuning BERT on downstream tasks has been shown to result in significant improvements for many NLP applications. The ArMT-TNN system’s architecture is illustrated in Fig. 1.

Figure 1.

Architecture of the proposed ArMT-TNN system.

We use Arabic BERT as our input encoder, and the output of the encoder is passed into a classification head based on the corresponding task. Each task has its own classification head. The input sequence of wordpieces $[w_{1},w_{2},w_{3},\ldots,w_{n}]$ where $n$ is the input length, is passed through a stack of encoders to learn the contextual representation of wordpieces through self-attention mechanism. The output of the encoder is the $h_{[\textit{CLS}]}$ , the classification embedding representing information of the inputs, and the contextualized wordpiece embeddings for each token $\emph{H}=[h_{1},h_{2},h_{3},\ldots,h_{n}]\in\mathbb{R}^{\textit{nxd}}$ . Both $h_{[\textit{CLS}]}$ and $h_{[i]}$ have the same dimension size $d$ . In the case of a single input sequence, the first token is the [CLS] special token, ending with [SEP] special token. In the case of a pair of inputs, we separate the inputs with the [SEP] special token. The proposed ArMT-TNN system architecture, depicted in Fig. 1, follows Algorithm 1 for training. In this approach, we have designed a Multitask Deep Neural Network System called ArMT-TNN that allows for learning multiple tasks simultaneously, with the hope of positive knowledge transfer among related tasks. To achieve this, we used task-specific classifiers for each learned task and different loss functions for different tasks. The details of these classifiers and loss functions are described in the following subsections.

: ArMT-TNN Training AlgorithmRequire: Training datasets $D$ for each task $T$ , Arabic pretrained LM: $\Theta$ . Initialize: Model parameters with Arabic LM. Set model hyper-parameters. Prepare: For each task $T$ , split dataset $d_{T}$ into mini-batches $b_{T}$ . Combine all mini-batches $b_{T}$ into one batch $B$ . Training: forepoch in $\textit{epoch}_{\max}$ do: for $b_{T}$ in $B$ do: Calculate loss $L(\Theta)$ using: Equation (3) for classification, Equation (4) for binary classification, Equation (5) for regression. end Sum all losses and compute the total loss using Eq. (6). Compute gradient $\nabla(\Theta)$ of the total loss. Update the model parameters: $\Theta=\Theta-\alpha\nabla(\Theta)$ . end

The ArMT-TNN training algorithm initializes its model parameters using a pretrained Arabic LM and sets specific hyper-parameters. For each task, the dataset is divided into mini-batches, which are then combined into a single batch B. During training, for each epoch, the algorithm iterates over each mini-batch in B, calculating the loss based on the task type (classification, binary classification, or regression). After summing all individual losses, the total loss is computed, its gradient is determined, and the model parameters are updated accordingly.

4.1 Task-specific classifiers

For each learned task, we use a dedicated classifier. For single sentence classification tasks, such as OOLD for binary classification, SEC for multi-label classification, or SVREG for regression, the model calculates the probability of class $c$ appearing given the input $X$ represented by the $h_{[\textit{CLS}]}$ embedding as:

$\displaystyle P(c\mid X)=\textit{sigmoid}(W_{\textit{oold}}.x),$ (1)

Here, $W_{\textit{oold}}$ represents the task-specific parameters for OOLD.

For single sentence multi-class single-label tasks, such as MDD, or pairwise text classification tasks, such as XNLI, we use softmax as:

$\displaystyle P(c\mid X)=\textit{softmax}(W_{\textit{MDD}}.x),$ (2)

4.2 Loss functions

Our dataset consists of four different types of tasks: multi-class classification, binary classification, multi-label classification, and regression. To train our model for each task, we use different loss functions tailored to each task’s specific requirements.

For multi-class classification tasks, such as XNLI and MDD, we use the cross-entropy loss function:

$\displaystyle\mathcal{L}_{\text{CE}}=-\sum_{i=1}^{c}y_{i},.,\log,\hat{y}_{i}$ (3)

Here, $c$ is the number of classes, $y_{i}$ represents the ground-truth label of the $i^{\text{th}}$ class, and $\hat{y}_{i}$ represents the predicted probability of the $i^{\text{th}}$ class. This loss function is suitable for multi-class classification tasks where the classes are mutually exclusive.

For binary and multi-label classification tasks, such as OOLD and SEC, we use the binary cross-entropy loss function:

$\displaystyle\mathcal{L}_{\text{BCE}}=-,\frac{1}{C}\sum_{i=1}^{c}y_{i},.,\log,% \hat{y}_{i}+(1-y_{i}),.,\log(1-\hat{y}_{i})$ (4)

Here, $y_{i}$ is the ground-truth label of the $i^{\text{th}}$ class, and $\hat{y}_{i}$ is the predicted probability of the $i^{\text{th}}$ class. This loss function is suitable for binary and multi-label classification tasks where each sample may belong to more than one class.

For regression tasks, such as SVREG, we use the mean squared error (MSE) loss function:

$\displaystyle\mathcal{L}_{\text{MSE}}=\frac{1}{C}\sum_{i=1}^{c}({y}_{i}-\hat{y% }_{i})^{2}$ (5)

Here, $y_{i}$ represents the ground-truth value of the $i^{\text{th}}$ sample, and $\hat{y}_{i}$ represents the predicted value. This loss function is suitable for regression tasks where the goal is to minimize the difference between the predicted and actual values.

To train our model for all tasks, we combine the loss functions for each task using a weighted sum:

$\displaystyle\mathcal{L}_{\text{MTL}}=\sum_{t=1}^{T}\lambda_{t}\mathcal{L}_{t}$ (6)

Here, $T$ is the total number of tasks, $\mathcal{L}_{t}$ is the loss function for task $t$ , and $\lambda_{t}$ is a hyperparameter that controls the weight of each task’s loss in the combined loss function.

5. Experimental setup and implementation

The section describes the methodology used in the experiments, including text preprocessing, data and models used, evaluation metrics, and implementation details such as cloud computing services, batch size, input sequence length, optimizer, and dropout rate.

5.1 Text preprocessing

We begin by performing basic Arabic text preprocessing, which includes removing diacritics, English letters, numbers, URLs, and emojis. For Twitter hashtags, we remove the hashtag and underscore symbols and replace them with whitespace. Additionally, we handle repeated characters, which are often used for emphasis or to convey strong emotions. We reduce words to their standard form, keeping at most two consecutive repeated letters.

5.2 Data and models

Our training and testing data come from the ALUE benchmark [8], as described in Section 3.2. We include the validation set during training, if available. We use the pretrained language models mentioned in Section 3.1 for fine-tuning, and compare our models with state-of-the-art models listed on the ALUE leaderboard.2

²
https://www.alue.org.

5.3 Evaluation metrics

We use four evaluation metrics to assess the performance of our models on the ALUE benchmark. F1-score is used for tasks such as MQ2Q, OOLD, OHSD, FID, and MDD, while Jaccard similarity score is used for SEC and Pearson correlation coefficient is used for SVREG. Accuracy is employed for XNLI.

5.4 Implementation details

We conducted the finetuning phase on AWS cloud computing services with 4 NVIDIA Tesla T4 GPUs, totaling 192GB of memory. Our implementation is based on the Huggingface library.3

³
https://huggingface.co/.

We used a batch size of 32 and an input sequence length of 512 for all experiments, and fine-tuned our models for 10 epochs. Adamax was used as our optimizer with a learning rate of 3e-5. We also utilized a linear learning rate decay schedule with a warm-up period of 0.1. Furthermore, we set the dropout rate for all task-specific layers to 0.1, following the recommendation in [10].

Table 2

Evaluation performance of MTL approach and state-of-the-art models on ALUE test set

Model	MQ2Q	OOLD	OHSD	SVREG	SEC	FID	XNLI	MDD	Avg
ARABIC-BERT	85.69	89.47	78.72	55.12	25.13	82.18	60.96	59.66	67.1
mBERT	83.24	80.33	70.54	33.85	14.02	81.61	63.09	61.26	61.0
JABER	93.1	91.4	79.6	70.9	31.7	85.3	73.4	64.1	73.7
SABER	93.3	93.4	84.1	79.2	38.8	86.5	76.3	66.5	77.3
AraBERT-v0.2	96.96	96.89	68.81	82.61	37.73	81.86	72.85	50.02	73.4
ArabicBERT multi dialect	92.78	95.92	72.41	82.85	35.84	82.79	58.59	50.49	71.4
MARBERT	95.06	96.97	76.34	87.23	40.27	83.85	65.7	53.75	74.9

Figure 2.

Performance results of all tasks on test datasets over 10 epochs. The x-axis denotes the number of epochs, while the y-axis represents the performance.

6. Results and discussion

We conducted an evaluation of our MTL approach using three Arabic pre-trained language models (PLM) – AraBERT-v02, ArabicBERT Multi Dialect, and MARBERT – against the state-of-the-art models found on the ALUE leaderboard. Our MTL approach outperformed all existing models on four ALUE tasks test sets, as shown in Table 2. The top part of the Table 2 shows the models found on ALUE leaderboard. Our proposed work is on the bottom part of the table showing three different models. The state-of-the-art results are shown in bold. F1-score metric is used for MQ2Q, OOLD, OHSD, FID, and MDD. Jaccard similarity score and Pearson correlation coefficient are used for SEC, and SVREG respectively, accuracy metric is used for XNLI. The rest of the tasks are using F1-score metric. We have also visualized the performance results of our models on the test datasets of all tasks over 10 epochs, and the corresponding plot is shown in Fig. 2.

Specifically, it can be observed from Table 2 that using AraBERT-v02 improved MQ2Q by 3.9% compared to SABER, while MARBERT outperformed SABER in OOLD, SVREG, and SEC by 3.8%, 10.1%, and 3.7%, respectively. However, SABER remained the state-of-the-art on the other four tasks, which are OHSD, FID, XNLI, and MDD, outperforming our best model by 10.1%, 3.1%, 4.7%, and 23.7%, respectively.

It is noticeable from our average score compared to Saber’s that although our overall average score is lower, some of our models perform exceptionally well on certain tasks. Specifically, our models show positive transfer and improve performance on MQ2Q, OOLD, SVREG, and SEC, depicted in Fig. 2. However, it appears that OHSD, FID, XNLI, and MDD do not benefit from knowledge sharing, and actually exhibit negative transfer. This may be attributed to the fact that the learning tasks are not uniformly related, with some tasks being dominant over others.

Furthermore, we observed that some tasks such as FID, OOLD, and SVREG experience performance fluctuation during weight updates, with these fluctuating tasks performing particularly well. We attribute this to the fact that their loss functions are smaller, which drowns out the gradients compared to tasks with larger loss functions that become more dominant for optimization. However, this is not always the case, as MQ2Q does not experience fluctuations even though its loss function is small. This may indicate that MQ2Q is highly related to the other tasks.

Interestingly, we also observed that language models pretrained on different data types from the tasks’ data type can still perform well on those tasks, as demonstrated by the success of MARBERT on MQ2Q and XNLI, which are both of the MSA type of Arabic text. MARBERT outperformed Arabic-BERT, mBERT, Jaber, and Saber on MQ2Q, but only outperformed Arabic-BERT and mBERT on XNLi. This suggests that MTL can overcome differences in data types, as long as there is some relatedness between the tasks. However, more tasks based on MSA text types should be added to ALUE for deeper insights.

Overall, our findings demonstrate the potential of our proposed model for improving performance on multiple related tasks, and highlight the importance of careful task selection for effective knowledge sharing.

7. Conclusion and future work

In conclusion, our study presented ArMT-TNN, a multitask learning framework for Arabic NLU tasks, which allows for knowledge sharing and transfer between tasks. We found that some tasks experienced positive transfer of knowledge, leading to improved performance, while others experienced negative transfer. Our results suggest that careful consideration of task relationships and loss scaling may mitigate the issue of negative transfer in future work. We also demonstrated the potential of using Arabic massive MTL to further improve the performance of our framework.

As for limitations, our study was limited to a set of eight Arabic NLU tasks, and further investigation on other tasks and domains may be needed to confirm our findings. Additionally, our study only utilized hard parameter sharing, and other types of parameter sharing mechanisms may have different impacts on knowledge transfer and performance.

For future work, we plan to further explore the issue of negative transfer and investigate more effective methods for mitigating it, such as dynamic weight allocation and task clustering. We also plan to extend our framework to handle more complex tasks, such as question answering and dialogue generation. Finally, we intend to explore the use of other parameter sharing mechanisms and investigate their effects on multitask learning performance.

Footnotes

Acknowledgments

This project was funded by the Deanship of Scientific Research (DSR), King Abdulaziz University, Jeddah, under grant no. (J: 13-611-1443). The author, therefore, acknowledge with thanks DSR technical and financial support.

References

Kenton

JDM-WC

Toutanova

. Bert: Pre-training of deep bidirectional transformers for language understanding. 2019; 4171–4186.

Zhang

Guo

Jiang

. A survey of multi-task learning in natural language processing: Regarding task relatedness and training methods. arXiv preprint arXiv:2204.03508. 2022.

Crawshaw

. Multi-task learning with deep neural networks: A survey. arXiv preprint arXiv:2009.09796. 2020.

Ruder

. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098. 2017.

Kendall

Gal

Cipolla

. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. 2018; 7482–7491.

Sanh

Wolf

Ruder

. A hierarchical multi-task approach for learning embeddings from semantic tasks. 2019; 33: 6949–6956.

Wang

, et al. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. 2018.

Seelawi

, et al. Alue: Arabic language understanding evaluation. 2021; 173–184.

Vaswani

, et al. Attention is all you need. Advances in Neural Information Processing Systems. 2017; 30.

10.

Liu

Chen

Gao

. Multi-task deep neural networks for natural language understanding. 2019; 4487–4496.

11.

Sun

Qiu

Huang

. How to fine-tune bert for text classification? Springer. 2019; 194–206.

12.

Liu

Chen

Gao

. Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv preprint arXiv:1904.09482. 2019.

13.

Clark

Luong

M-T

Khandelwal

Manning

. Bam! born-again multi-task networks for natural language understanding. arXiv preprint arXiv:1907.04829. 2019.

14.

Aghajanyan

, et al. Muppet: Massive multi-task representations with pre-finetuning. arXiv preprint arXiv:2101.11038. 2021.

15.

Aribandi

, et al. Ext5: Towards extreme multi-task scaling for transfer learning. arXiv preprint arXiv:2111.10952. 2021.

16.

Kim

Hong

Seo

. Multi-task learning for knowledge graph completion with pre-trained language models. 2020; pp. 1737–1743.

17.

. Lp-bert: Multi-task pre-training knowledge graph bert for link prediction. arXiv preprint arXiv:2201. 04843. 2022.

18.

Morio

Ozaki

Morishita

Yanai

. End-to-end argument mining with cross-corpora multi-task learning. Transactions of the Association for Computational Linguistics. 2022; 10: 639–658.

19.

Peng

Chen

. An empirical study of multi-task learning on bert for biomedical text mining. arXiv preprint arXiv:2005.02799. 2020.

20.

Jia

, et al. All-in-one: Multi-task learning bert models for evaluating peer assessments. International Educational Data Mining Society. 2021.

21.

Yang

J-C

. Cross-modal multitask transformer for end-to-end multimodal aspect-based sentiment analysis. Information Processing & Management. 2022; 59: 103038.

22.

Ghosh

Ekbal

Bhattacharyya

. Vad-assisted multitask transformer framework for emotion recognition and intensity prediction on suicide notes. Information Processing & Management. 2023; 60: 103234.

23.

Wei

, et al. Finetuned language models are zero-shot learners. 2022.

24.

Sanh

, et al. Multitask prompted training enables zero-shot task generalization. 2022.

25.

Wang

, et al. Benchmarking generalization via in-context instructions on 1,600

+

language tasks. arXiv preprint arXiv:2204.07705. 2022.

26.

Liu

, et al. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems. 2022; 35: 1950–1965.

27.

Wang

, et al. Multitask prompt tuning enables parameter-efficient transfer learning. 2023.

28.

Farha

Magdy

. Multitask learning for arabic offensive language and hate-speech detection. 2020; 86–90.

29.

Djandji

Baly

Antoun

Hajj

. Multi-task learning using arabert for offensive language detection. 2020; 97–101.

30.

Aldjanabi

, et al. Arabic offensive and hate speech detection using a cross-corpora multi-task learning model. MDPI. 2021; 8: 69.

31.

AlKhamissi

Diab

. Meta ai at arabic hate speech 2022: Multitask learning with self-correction for hate speech classification. arXiv preprint arXiv:2205.07960. 2022.

32.

Mahdaouy

, et al. Deep multi-task model for sarcasm detection and sentiment analysis in arabic language. arXiv preprint arXiv:2106.12488. 2021.

33.

Alharbi

Lee

. Multi-task learning using a combination of contextualised and static word embeddings for arabic sarcasm detection and sentiment analysis. 2021; 318–322.

34.

Takase

Okazaki

. Multi-task learning for cross-lingual abstractive summarization. arXiv preprint arXiv:2010.07503. 2020.

35.

El Mekki

, et al. Bert-based multi-task model for country and province level msa and dialectal arabic identification. 2021; 271–275.

36.

Abd Elaziz

, et al. A hybrid multitask learning framework with a fire hawk optimizer for arabic fake news detection. Mathematics. 2023; 11: 258.

37.

Jarrar

Khalilia

Ghanem

. Wojood: Nested arabic named entity corpus and recognition using bert. 2022; 3626–3636.

38.

Antoun

Baly

Hajj

. AraBERT: Transformer-based model for Arabic language understanding. European Language Resource Association, Marseille, France. 2020; 9–15. https://aclanthology.org/2020.osact-1.2.

39.

Talafha

, et al. Multi-dialect arabic bert for country-level dialect identification. 2020; 111–118.

40.

Safaya

Abdullatif

Yuret

. Kuisail at semeval-2020 task 12: Bert-cnn for offensive speech identification in social media. 2020; 2054–2059.

41.

Abdul-Mageed

Elmadany

, et al. Arbert & marbert: Deep bidirectional transformers for arabic. 2021; 7088–7105.

42.

Seelawi

Mustafa

Al-Bataineh

Farhan

Al-Natsheh

. Nsurl-2019 task 8: Semantic question similarity in arabic. 2019; 1–8.

43.

Mubarak

Darwish

Magdy

Elsayed

Al-Khalifa

. Overview of osact4 arabic offensive language detection shared task. 2020; 48–52.

44.

Mohammad

Bravo-Marquez

Salameh

Kiritchenko

. Semeval-2018 task 1: Affect in tweets. 2018; 1–17.

45.

Ghanem

Karoui

Benamara

Moriceau

Rosso

. Idat at fire2019: Overview of the track on irony detection in arabic tweets. 2019; 10–13.

46.

Conneau

, et al. Xnli: Evaluating cross-lingual sentence representations. arXiv preprint arXiv:1809.05053. 2018.

47.

Bouamor

Hassan

Habash

. The madar shared task on arabic fine-grained dialect identification. 2019; 199–207.

ArMT-TNN: Enhancing natural language understanding performance through hard parameter multitask learning in Arabic

Abstract

Keywords

1. Introduction

1 Leaderboard https://www.alue.org/.

2.1 MTL in english NLP

2.2 MTL in Arabic NLP

3. Arabic language resources

3.1 Arabic pretrained language models

5.1 Text preprocessing

5.2 Data and models

2 https://www.alue.org.

5.4 Implementation details

3 https://huggingface.co/.

7. Conclusion and future work

Footnotes

Acknowledgments

References

¹
Leaderboard https://www.alue.org/.

²
https://www.alue.org.

³
https://huggingface.co/.