Improved neural machine translation for low-resource English

Abstract

Language translation is essential to bring the world closer and plays a significant part in building a community among people of different linguistic backgrounds. Machine translation dramatically helps in removing the language barrier and allows easier communication among linguistically diverse communities. Due to the unavailability of resources, major languages of the world are accounted as low-resource languages. This leads to a challenging task of automating translation among various such languages to benefit indigenous speakers. This article investigates neural machine translation for the English–Assamese resource-poor language pair by tackling insufficient data and out-of-vocabulary problems. We have also proposed an approach of data augmentation-based NMT, which exploits synthetic parallel data and shows significantly improved translation accuracy for English-to-Assamese and Assamese-to-English translation and obtained state-of-the-art results.

Keywords

English–Assamese NMT low-resource transformer RNN

1 Introduction

A natural language can be categorized into high-resource, medium-resource, and low-resource language based on the availability of resource. The resource includes works of several native speakers, online data resources, and computational data resources. The low-resource language category fits on a language that has few online resources [1, 2] or computational data [3]. However, a low-resource language scenario can be considered in terms of minimal data required for training the machine translation (MT) model [4]. The exact definition of low-resource language pair is a research question itself. The languages that contain highly inflected words further complicate to define the term “low-resource” due to the presence of sparsity problems with different styles of inflected words. This demands more bitext data to achieve equivalent results against the languages with less inflected words [5]. Defining low-resource in the context of parallel sentence pairs is itself a topic of computational analysis. Conventionally, if the train data contains less than 1 million parallel data, then the language pairs can be considered low-resource [6]. Most of the world languages fall under the category of low-resource, based on the availability of resources. The English-Assamese (En-As) pair can be categorized under low-resource based on the availability of insufficient resources. The Assamese language is the official language of the Indian state of Assam. The native speakers of the Assamese language, also known as Assamese or Axomiya people, are about 1,301,0478 1 speakers [7]. It falls under the Indo-Aryan language family, and its script originated from the Gupta script [8]. The script and word order of English and Assamese languages are very different from each other. The word order of Assamese is subject-object-verb (SOV) whereas English follows subject-verb-object (SVO) as shown in Fig. 1. Unlike English, Assamese is a morphologically rich language [9].

Fig. 1

Word order example of Assamese with English translation and transliteration.

We have chosen English with Assamese because English is a high resource and a widely accepted language worldwide. To establish good communication at an international and national level, it is very much essential for the automatic translation of English-Assamese pair. Although Google covers automatic translation of 109 languages 2 worldwide, Assamese language is yet to be introduced. Due to the limitation of the suitable dataset, the English-Assamese MT system is in the outset stage [7 , 11]. Therefore, there are research scopes in the English-Assamese MT system. In this work, we investigate: How can effectively improve the neural machine translation (NMT) model in both directions of translations for the low-resource English-Assamese pair? The contributions of this article are as follows:

Proposed an approach based on data augmentation and handled problems like insufficient data and out-of-vocabulary to enhance the performance of NMT for low-resource En-As pair.

Explored different NMT models and obtained state-of-the-art MT performance for both En-to-As and As-to-En directions.

The rest of the paper is organized as follows: Section 2 briefly discuss the background of machine translation and related work. The dataset description and baseline system are presented in Section 3. Section 4 and 5 describe about tackling the issues of insufficient data and out-of-vocabulary. Section 6 and 7 present the proposed approach and experimental setup. The result and analysis are presented in Section 8. Lastly, Section 9 concludes the paper with future work.

2 Background

In this section, we briefly discuss the fundamental concept of MT models and review the existing works related to our work.

2.1 Machine translation

MT covers language ungraspable issues via automatic translation among natural languages. In MT, there are two broad categories of approaches: the rule-based or knowledge-based approach, which depends on a set of rules, and the corpus-based approach that depends on data, also known as the data-driven approach. MT system switched from a rule-based to a corpus-based approach, which eliminated the need for linguistic experts or language otherness in the case of interlingual MT [12]. Statistical machine translation (SMT) and NMT approaches are two popular categories of corpus-based MT system. The SMT has a variety of techniques that include word-based, syntax-based, phrase-based, and hierarchical phrase-based. Before NMT, phrase-based SMT is the state-of-the-art technique in MT [13]. In this work, we have used phrase-based SMT to extract phrase pairs and that will be used to expand the parallel train corpus. The disadvantages of SMT are the long term dependency problem, inefficient context analysing ability and system complexity that shifts the attention towards NMT [14]. For the feed-forward neural network-based NMT system, the phrase pairs score is calculated by assuming the fixed length of the phrases. However, in real-time scenarios, the source and target phrase length of the translation are not fixed. To deal with variable-length phrases, a recurrent neural network (RNN) based NMT system is introduced [15, 16], where sequence learning is achievable through an end-to-end approach. To learn long-term features, RNN adopts long short term memory (LSTM) for encoding and decoding. However, the encoder fails to encode all the necessary information when the sentence is too long. To mitigate this issue, an attention mechanism is proposed [17]. The attention mechanism permits the decoder to focus on various segments of the source sequences at various decoding steps. In [18], the attention mechanism is improved through a combination of global steps by associating all the source words and local-only focus on the part of the source words. The encoder takes s₁, s₂ … s_n has input sequence and converts it into a vector X. The decoder generates the output t₁, t₂ … t_m via calculation of condition probability, as shown in Eq. (1). $P (t ∣ s) = \sum_{i = 1}^{m} P (t_{i} ∣ t_{< 1}, X)$ (1) A series of hidden states are estimated at the source side (h_i) and target side (h_o) that are finally correlated to produce an attention vector a_o. The magnitude of a_o correlates to the frequency of time steps in the input sequence using Eq. (2). $a_{o} = \frac{exp (score (h_{o}, h_{i}^{'}))}{\sum_{i^{'}} exp (score (h_{o}, h_{i^{'}}))}$ (2) In this work, we have incorporated the general estimate of score function using Eq. (3). $score (h_{o}, h_{i}^{'}) = h_{o} W_{a} h_{i}^{'}$ (3) By incorporating the weighted average on all the input side hidden states along with the attention vector, the context vector c_l is calculated. The concatenation of h_o and c_t produces an attentional hidden vector using Eq. (2_hid). $h_{o}^{'} = tanh (W_{c} [c_{t}, h_{o}])$ (4) The final step is the inclusion of softmax layer to the vector $h_{o}^{'}$ using Eq.(4) to get the translated sentence in the form of the target language. $P (t_{j} ∣ t_{< 1}, X) = softmax (W_{s} h_{o}^{'})$ (5) The drawback of RNN is that input processing follows in a strict temporal order, which means it computes context in one direction based on the preceding words and not on future words. RNNs are impotent to look ahead into future words. Bidirectional RNN (BRNN) resolves this issue by utilizing two distinct RNNs: forward direction and another for the backward direction [19]. Moreover, the convolutional neural network (CNN) based NMT system has been introduced [20, 21] by leveraging parallelizing operations and considering relative positions of the tokens instead of the temporal dependency among the tokens of the sequence. However, it lags behind features of RNN to enhance the encoding of the source sentences. The demerits of CNN-based approaches is that it requires many layers to hold long-term dependency, which makes the network large or complex without ever succeeding and seems unusable. To handle such an issue, a transformer-based NMT is employed by [22]. The idea behind the transformer model is to encode each position and apply a self-attention mechanism to connect two different words, which would then be parallelized to accelerate learning. Unlike the traditional attention mechanism, the self-attention mechanism calculates attention several times, known as multi-head attention. In the transformer model, the encoder and decoder consist of six similar attention layers mutilated on top of each other. The input sequence is embedded and position-encoded to merge the sequence of words before inputting the sequence into the network. The encoder encompasses a multiple-headed attention layer and point-wise associated feed-forward network layer. The decoder in total constitutes three layers, two of which are identical to that of the encoder. The third layer of the decoder is the multi-head attention layer that aims to attend to the output string forwarded by the encoder. To compute the attention in this model, the dot product of the input and implementation of the softmax function to learn the weight of each word at a given position is used via Eq. (6). $Attn (Q, K, V) = softmax (\frac{Q T^{k}}{\sqrt{d_{k}}}) V$ (6) Where input vectors namely, query (Q), key (K) with dimension d_k, and value (V) are used to calculate the attention. The advantage of incorporating multi-head (MH) attention in comparison to single head attention in this model helps to deal with various word representations via multiple positions. The number of parallel attention head accounts to h = 8, as show in Eq. (7) and (8).

$MH (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O}$ (7)

${head}_{i} = Attn (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{v})$ (8)

Where the parameter matrices $W_{i}^{Q} \in R^{d_{model} \times d_{k}}, W_{i}^{K} \in R^{d_{model} \times d_{k}}$ , and $W_{i}^{V} \in R^{d_{model} \times d_{r}}$

The NMT based MT approaches achieve state-of-the-art methods in both high and low-resource pair translations because of its ability to handle long-term dependency problems and contextual analysis [23 –26]. Nevertheless, it requires sufficient parallel corpus for training, which is a challenging issue in a low-resource scenario. In this work, RNN and transformer based NMT models are used for En-As pair translation.

2.2 Related work

The common challenges in low-resource NMT include insufficient data, out-of-vocabulary, and rare-word problems [27, 28]. In respect of MT work on En-As, very limited MT works have been performed on En-As pair [7 , 11]. In [7, 10], SMT based system is developed with a very limited dataset. However [11], proposed a corpus and implemented two baseline systems using phrase-based SMT and RNN based NMT. In this work, we have utilized the En-As corpus of [11] and attempt to improve the NMT by tackling insufficient data and out-of-vocabulary problems in both En-to-As and As-to-En directions.

3 Dataset description and baseline system

The English-Assamese corpus, EnAsCorp1.0 3 is developed in [11], and the same has been used in this work. It comprises both parallel and monolingual data. In EnAsCorp1.0, different possible online sources were explored to prepare parallel data that included Bible, multilingual online dictionary (Glosbe, Xobdo), SEBA multilingual question paper, PMIndia, and Learn-Assamese website. Also, monolingual Assamese data was prepared by collecting data from the various webpages, blogs, Holy books (Bible, Gita, Quran), and Assamese data from the parallel corpus were added to increase the size of monolingual data. The data statistics are presented in Table 1. Moreover, they utilized English monolingual data of about 3 million sentences from WMT16 to develop baseline systems for both En-to-As and As-to-En translations [11]. For English–Assamese baseline systems [11], two models were implemented, phrase-based SMT (baseline-1) and RNN-based NMT (baseline-2) and results of bilingual evaluation under study (BLEU) [29] scores are shown in Table 2. In this work, we have attempted to improve the translational performance of the existing work [11]. For comparative analysis, we have collected the predicted output of [11] and evaluated with other evaluation metrics, namely, translation error rate (TER) [30], rank-based intuitive bilingual evaluation score (RIBES) [31], metric for evaluation of translation with explicit ordering (METEOR) [32], and F-measure scores, as shown in Tables 3, 4.

Table 1
Data statistics of EnAsCorp1.0

Type Sentences Tokens

En As

Parallel Train 203,315 2,414,172 1,986,270

Validation 4,500 74,561 59,677

Test Set 2,500 41,985 34,643

Monolingual 2,624,828 - 45,900,321

Type	Sentences	Tokens
Parallel	Train	203,315	2,414,172	1,986,270
	Validation	4,500	74,561	59,677
	Test Set	2,500	41,985	34,643
Monolingual	2,624,828	-	45,900,321

Table 2

BLEU scores of baseline systems [11]

Translation	System	BLEU
En-to-As	PBSMT	3.43
	NMT (RNN)	5.55
As-to-En	PBSMT	4.54
	NMT (RNN)	7.72

Table 3

TER and RIBES scores of baseline systems

Translation	System	TER	RIBES
En-to-As	PBSMT	106.14	0.121474
	NMT (RNN)	96.93	0.252370
As-to-En	PBSMT	93.12	0.185537
	NMT (RNN)	91.64	0.300727

Table 4

METEOR and F-measure scores of baseline systems

Translation	System	METEOR	F-measure
En-to-As	PBSMT	0.045206	0.090317
	NMT (RNN)	0.094548	0.228664
As-to-En	PBSMT	0.138945	0.405864
	NMT (RNN)	0.183634	0.482956

4 Tackling insufficient data problem

Without modifying the NMT model architecture, we have tackled insufficient data issue in two-ways: injecting phrase-pairs and utilizing synthetic parallel data. Phrase Pairs Injection: In [33], phrase-based SMT is trained on source-target using Moses 4 toolkit and extracted all the phrase pairs from the phrase table. The main objective behind phrase pairs extraction is to provide more word alignment information of source-target phrases by augmenting these with the original parallel data. In Moses, GIZA++ [34] is used for word alignment. The phrase table is created during the training process. It is noted that the LM does not affect the phrase table [33]. Extracted phrase pairs by considering different translation probabilities of target phrases given source phrases. It is because extracted phrase pairs may contain many wrong alignment phrases [35]. We have extracted English-Assamese phrase pairs following [33] and augmenting with original parallel data. The statistics of the parallel training data after phrase pairs injection are shown in Table 5.

Table 5
Train data statistics after phrase pairs injection

Type Sentences

T1: “original parallel corpus” 203,315

T2: “original parallel corpus” + Set_p≥0.5 639,787

T3: “original parallel corpus” + Set_p=1 445,957

T4: “original parallel corpus” + Set_all 1,433,121

Type	Sentences
T1: “original parallel corpus”	203,315
T2: “original parallel corpus” + Set_p≥0.5	639,787
T3: “original parallel corpus” + Set_p=1	445,957
T4: “original parallel corpus” + Set_all	1,433,121

Table 5 shows four types of train data that we investigate using different NMT models to examine the performance, which are reported in Section 8.

“original parallel corpus”: Only original parallel train data is considered in the NMT training models.

“original parallel corpus”+ Set_p≥0.5: The original parallel train data along with phrase pairs having translation probability p ≥ 0.5 are considered in the NMT training models.

“original parallel corpus”+Set_p=1: The original parallel train data along with phrase pairs having translation probability p = 1 are considered in the NMT training models.

“original parallel corpus”+Set_all: The original parallel train data along with all phrase pairs are considered in the NMT training models.

[33] combine the extracted phrase-pairs with the original parallel train data. They maintain a fair ratio using Eq. (9) to tackle the trained models from biased towards sequence length.

$Augmentation = N \times OPT + ExtractedPP$ (9) Here, N=Number of extracted phrase pairs / Original parallel train sentences. OPT= Original parallel train sentences, PP= Phrase pairs.

However, by the inclusion of identical source-target sentences, corpus quality degrades. It can lower the performance of NMT [36, 37]. In this work, we have augmented phrase pairs with the original train data without considering a ratio unlike [33]. To examine the performance of NMT models with phrase pairs and whether the trained NMT models are biased towards sequence length or not, we have performed a comparison in Section 8.

Synthetic parallel data: We have utilized monolingual data to generate synthetic parallel data to expand the parallel corpus using the BT strategy [38]. However, due to the presence of noise in the synthetic data, the translational model performance with augmented data is lower than the “original parallel corpus.” Therefore, to leverage synthetic parallel data in the training model, we have used a two-step solution [39]. First, pre-train the model with synthetic data and “original parallel corpus + phrase pairs + noun phrases” and then fine-tune it using only the “original parallel corpus + phrase pairs + noun phrases.” A series of experiments are performed to examine the effect of synthetic parallel data by considering different parallel and synthetic corpora ratio that are reported in Section 8.

Algorithm 1 Dealing with <unk> to tackle out-of-vocabulary problem
Input: Source Test Sentences S
Output: Predicted Target Sentence without <unk>
1: whilei < N + 1 do⊳ N denotes maximum number of source sentences
2: M → S_i ⊳ source sentence (S) is passed to the trained NMT model (M)
3: M → P_i or M (S_i) = P_i⊳ we obtain the predicted sentence P
4: forP_i_j in P_ido
5: ifP_i_j is <unk> and S_i_j is in Dictionary then
6: P_i_j = Dictionary(S_i_j) ⊳ source to target dictionary is used here
7: else
8: P_i_j = S_i_j ⊳ replace <unk> by the original source word
9: end if
10: end for
11: forP_i_j in P_i
12: ifP_i_j belongs to source script then
13: P_i_j = Transliterate(S_i_j) ⊳ transliterate to target script
14: end if
15: end for
16: end while

5 Tackling out-of-vocabulary problem

The Out-of-Vocabulary (OOV) problem arises due to the named-entities, compounds, technical terms and misspelled words [40]. There are two types OOV: Completely Out-of-Vocabulary (COOV) and Sense Out-of-Vocabulary (SOOV). If the words are absent in the training data then it is called as COOV, whereas SOOV represents those words which are present in the training data with different sense or usage from the test data words. NMT produces <unk> (unknown) tokens against OOV words. Moreover, NMT shows weakness in case of rare word translation since fixed-size vocabulary, which forces producing <unk> [16 , 41]. Have pointed out that the sentences having rare words produces poor translation than sentences having frequents words. We have tackled the OOV issue for the English–Assamese pair by augmenting parallel nouns in the train data and using Algorithm-1 in the translation process. We have compared Algorithm-1 with the existing technique of word segmentation, byte pair encoding (BPE) [42], reported in Section 8. BPE is used to handle the OOV issue. For BPE, we have used 32k merge operations. Algorithm-1 consists of two main components: bilingual dictionary and transliteration module.

Bilingual Dictionary: We have created a bilingual English–Assamese dictionary of noun phrases. We have extracted noun phrases from the English side parallel and monolingual corpora using NLTK 5 tool. The obtained noun phrases are manually translated into Assamese. The manual process took substantial human effort and was cross-verified by a linguistic expert. After removing duplicates, the total number of parallel noun phrases of bilingual dictionary become 334,585. The parallel noun phrases contain single, multi-word and named-entity. This bilingual dictionary is used in two ways. First, to augment in the training data. Secondly, to replace the <unk> tokens with the appropriate target words concerning source words, as shown in Algorithm-1.

Transliteration Module: This module is employed to source words which are not present in the bilingual dictionary. It is mainly used to tackle the unseen named-entity words that produce <unk>. The transliteration module is used to convert the source word into the target word script in the predicted sentence. We have used indic-trans 6 for both En-to-As and As-to-En transliteration, as shown in Algorithm-1.

Limitation of Algorithm-1: In the bilingual dictionary, each word can belong to many synonyms of different meanings. Thus, it requires replacing an appropriate source word based on context. In future work, we will solve this issue. Also in this work, we are targetting nouns only to handle OOV problem. Further, the transliteration module replaces the source words not present in the bilingual dictionary with the target words script. It is suitable for handling the OOV issue due to unseen named-entity words. In contrast, if the source words are not named-entity words and can be common nouns, verbs, adverbs, and pronouns. Then, the transliteration module is not a good choice. We will solve these problems in future work.

6 Proposed approach

Our proposed approach utilizes phrase pairs as discussed in Section 4 to augment with “original parallel corpus.” The experiments are carried out in three different phases. Initially, for the first phase, we have trained different NMT models with only “original parallel corpus + phrase pairs” to select the best type of phrase pairs (as given in Table 5) and best-trained model, as reported in Section 8. It is observed that the transformer model achieves the best translational performance from As-to-En and higher than the reverse direction of translation, i.e., En-to-As translation. Then, parallel noun phrases (as discussed in Section 5) are augmented with “original parallel corpus + phrase pairs.” and trained the NMT model (transformer) to obtain the better model, as reported in Section 8. Therefore, we have chosen the best trained NMT model (transformer) of As-to-En to generate synthetic En sentences using the monolingual sentence of As in the second phase. For simplicity, we have considered maximum sentences (Assamese) with a length of 20 words only during the generation of synthetic English sentences. Moreover, Algorithm-1 is applied during the translation process to tackle the OOV problems. We have removed blank lines, single word sentences from the synthetic En as well as corresponding As sentences. Then, in the third phase, we have trained the NMT model (transformer) on the augmented dataset of original and pseudo parallel data. However, due to the presence of noise in the synthetic data, the translational model performance with augmented data is lower than the “original parallel corpus + phrase pairs + noun phrases.” Therefore, to leverage synthetic parallel data in the training model, we have pre-trained the model with synthetic data and “original parallel corpus + phrase pairs + noun phrases” and then fine-tune it using only the “original parallel corpus + phrase pairs + noun phrases” following the technique of [39]. Hence, the final model initializes the parameters from the pre-trained model that enhances the training performance when the “original parallel corpus + phrase pairs + noun phrases” was used. Lastly, OOV problems are handled using Algorithm-1 during the translation process. Moreover, by utilizing monolingual data with Glove [44], we have leveraged the pre-trained word vectors in NMT models. The proposed approach is as illustrated in Fig. 2. The Algorithm-2 depicts step by step working of the proposed approach. In Section 8, we have reported results and analysis by considering various data combinations, and it shows our system yields better translational performance than the baseline system in En-As low-resource language pair translation.

Fig. 2

Proposed approach for leveraging synthetic parallel data to train the NMT Model (transformer) and fine-tuned on the “original parallel corpus + phrase pairs + noun phrases.”

Algorithm 2 Leveraging synthetic parallel data to train the NMT Model and fine-tuned on the parallel data (“original parallel corpus +
phrase pairs + noun phrases”)
Input: Parallel dataset P_d = {X, Y} (En-As), Monolingual data M_d (En/As).
1: Perform pre-train embeddings: P_emb ← M_d
2: With Algorithm-1 Generate E_synt using M_d through NMT_{m ₁} on As-to-En →P_emb, P_d. ⊳ NMT_{m ₁}= NMT model₁,
E_synt: Synthetic En sentence.
3: Create D_synt ←M_d + E_synt obtained from Step 2. ⊳ M_d =Monolingual data of As ⊳ D_synt: Pseudo parallel data
4: D_aug ← P_d + D_synt ⊳ D_aug=Data augmentation
5: Pre-train:NMT_{m ₂} ← D_aug, P_emb ⊳ NMT_{m ₂}=NMT model₂
6: Fine-tuned P_d:NMT_{m ₃} ← Pre-trained NMT_{m ₂}, P_emb ⊳ NMT_{m ₃}=NMT model₃
7: With Algorithm-1 Generate P_t ← S_t using NMT_{m ₃}. ⊳ P_t=Predicted target sentence, S_t= Source sentence (test set)

7 Experimental setup

We have employed NMT systems with different models, namely, RNN, BRNN, and Transformer models. The models are implemented using the OpenNMT-py 7 toolkit, which is freely available. There are four primary steps in our experiments: unsupervised pretraining, preprocessing, supervised training, and testing. The experiments are carried out separately for En-to-As and As-to-En translation. We have used unsupervised pre-trained word vectors of monolingual data using Glove 8 . The pretraining is performed up to 100 iterations with embedding vector size 200. The primary need for the preprocessing step is the tokenization of the source and target sequences that creates the word vocabularies. It also generates indices sequences while performing indexing on each word in the vocabulary during the training process. The vocabulary dimension of the source-target sentences are taken as 50,000. In the training process, we have used a 2-layer network of LSTM units that contains 512 nodes in each layer for RNN and BRNN. We have used a drop-out value of 0.3 in RNN and BRNN. In case of the transformer model, default 6 layers and 0.1 drop-out are used. Also, Adam optimizer with a default learning rate of 0.001 is used. The models are trained on a single NVIDIA Quadro P2000 GPU up to 200,000 epochs. In the testing step, predicted sentences are generated by utilizing the optimum trained model obtained from the training process on test data. We have used beam search technique with default size 5 to find out the best translations.

8 Result and analysis

We have used automatic evaluation metrics to evaluate the quantitative results of predicted translations. The automatic evaluation metrics such as BLEU, TER, RIBES, METEOR, and F-measure scores. Tables 6 and 7, present the BLEU score results of various NMT models by considering phrase pairs injection of data types, as shown in Table 5. From the Tables 6 and 7, it is noticed that the transformer model (T2:“original parallel corpus”+Set_p≥0.5) with Algorithm-1 outperforms in both En-to-As and As-to-En directions of translation, as bold mark in Tables 6 and 7. Therefore, we have considered only the transformer model(T2:“original parallel corpus”+Set_p≥0.5) in Table 8. It is observed that the transformer model without considering the ratio (N), provides better than the transformer model with ratio (N). It is because addition of identical source-target sentences by the consideration of the ratio (N), degrades translational performance. Also, the trained models are not biased towards short sentences since the ratio of number of extracted phrase pairs to the original parallel train sentences already maintain a fair ratio 1:2 in case of, T2:“original parallel corpus”+Set_p≥0.5 (1(“original parallel corpus”):2(Set_p≥0.5)) as shown in Table 5. We perform a comparison between BPE and Algorithm-1 in Tables 6, 7. The BPE faces issues in disambiguating the words of Assamese. Because it is a morphologically rich language containing various affixes, it is the reason for lower translational performance than word-based Algorithm-1. Furthermore, we have investigated the transformer model performance with parallel noun phrases and the results are reported in Table 9. The trained model of As-to-En translation is used to generate synthetic parallel data. In Table 10 and 11, 1-to-1 ratio (“original parallel corpus”: synthetic parallel) with and without phrase pairs + noun phrases are considered. Here, first pre-train on the augmented data and then fine-tune on the “original parallel corpus + phrase pairs + noun phrases.” It is observed that the transformer model achieves higher BLEU score for As-to-En translation on 1:4 + phrase pairs + noun phrases, as bold marked in Table 10. For En-to-As translation, the transformer model achieves higher BLEU score on 1:5 + phrase pairs + noun phrases, as bold mark in Table 11. Table 12, 13 and 14 present comparison of our system with baseline NMT system (as discussed in Section 3) in terms of BLEU, TER, RIBES, METEOR and F-measure scores. Here, our system consider, 1:4 + phrase pairs + noun phrases for As-to-En and 1:5 + phrase pairs + noun phrases for En-to-As translation. Higher the score value in case BLEU, RIBES, METEOR, and F-measure except for TER indicates better translation accuracy. From Table 12, 13, 14, 2, 3 and 4, it is observed that our system outperforms baseline NMT and phrase-based SMT systems. By handling the OOV problem using Algorithm-1 and data augmentation in train data, reason about better translation performance in our system compared to the baseline systems.

Table 6
BLEU scores of En-to-As NMT models on four types of data, as given in Table 5

Model BLEU

RNN T1: 5.55

T2: 6.67

T3: 6.62

T4: 5.23

RNN (with BPE) T1: 4.64

T2: 4.98

T3: 4.92

T4: 4.12

RNN (with Algorithm-1) T1: 5.67

T2: 6.78

T3: 6.73

T4: 5.32

BRNN T1: 5.58

T2: 6.69

T3: 6.64

T4: 5.25

BRNN (with BPE) T1: 4.78

T2: 5.02

T3: 4.97

T4: 4.24

BRNN (with Algorithm-1) T1: 5.68

T2: 6.81

T3: 6.76

T4: 5.34

Transformer T1: 5.66

T2: 7.32

T3: 7.26

T4: 6.68

Transformer (with BPE) T1: 5.17

T2: 5.78

T3: 5.48

T4: 4.86

Transformer (with Algorithm-1) T1: 5.72

T2: 7.46

T3: 7.34

T4: 6.76

Model	BLEU
RNN	T1: 5.55
	T2: 6.67
	T3: 6.62
	T4: 5.23
RNN (with BPE)	T1: 4.64
	T2: 4.98
	T3: 4.92
	T4: 4.12
RNN (with Algorithm-1)	T1: 5.67
	T2: 6.78
	T3: 6.73
	T4: 5.32
BRNN	T1: 5.58
	T2: 6.69
	T3: 6.64
	T4: 5.25
BRNN (with BPE)	T1: 4.78
	T2: 5.02
	T3: 4.97
	T4: 4.24
BRNN (with Algorithm-1)	T1: 5.68
	T2: 6.81
	T3: 6.76
	T4: 5.34
Transformer	T1: 5.66
	T2: 7.32
	T3: 7.26
	T4: 6.68
Transformer (with BPE)	T1: 5.17
	T2: 5.78
	T3: 5.48
	T4: 4.86
Transformer (with Algorithm-1)	T1: 5.72
	T2: 7.46
	T3: 7.34
	T4: 6.76

Table 7

BLEU scores of As-to-En NMT models on four types of data, as given in Table 5

Model	BLEU
RNN	T1: 7.72
	T2: 7.76
	T3: 7.73
	T4: 6.87
RNN (with BPE)	T1: 5.54
	T2: 5.87
	T3: 5.48
	T4: 4.92
RNN (with Algorithm-1)	T1: 7.84
	T2: 7.93
	T3: 7.88
	T4: 6.94
BRNN	T1: 7.74
	T2: 7.78
	T3: 7.77
	T4: 6.94
BRNN (with BPE)	T1: 5.68
	T2: 6.08
	T3: 5.98
	T4: 4.97
BRNN (with Algorithm-1)	T1: 7.82
	T2: 7.95
	T3: 7.86
	T4: 6.98
Transformer	T1: 8.69
	T2: 10.23
	T3: 10.16
	T4: 8.67
Transformer (with BPE)	T1: 6.96
	T2: 7.09
	T3: 6.98
	T4: 5.88
Transformer (with Algorithm-1)	T1: 8.78
	T2: 10.68
	T3: 10.22
	T4: 8.74

Table 8

BLEU scores of transformer model (T2:“original parallel corpus”+Set_p≥0.5) with Algorithm-1 on three groups of test data. EA:En-to-As and AE:As-to-En. Here, N denotes the ratio of number of extracted phrase pairs to the original parallel train sentences

Sentence Group	Sentences	Transformer-EA (without N)	Transformer-EA (with N)	Transformer-AE (without N)	Transformer-AE (with N)
1-5	183	6.11	6.03	7.10	7.06
6-10	595	7.76	7.64	11.14	11.09
11-80	1722	7.54	7.51	11.09	11.06

Table 9

BLEU scores of transformer model with and without noun phrases (T2: “original parallel corpus” + Set_p≥0.5)

Translation	Parallel Corpus	BLEU
As-to-En	T2	10.68
	T2+Noun Phrases	11.34
En-to-As	T2	7.46
	T2+Noun Phrases	7.54

Table 10

BLEU scores of different proportion using As-to-En NMT (Transformer) (PP: phrase pairs, NP: noun phrases)

Ratio	Pre-train + Fine-tune
1:1	8.84
1:1 + PP + NP	11.06
1:2	8.88
1:2 + PP + NP	11.12
1:3	8.96
1:3 + PP + NP	11.16
1:4	9.06
1:4 + PP + NP	13.02
1:5	8.98
1:5 + PP + NP	11.02
1:6	8.86
1:6 + PP + NP	10.97
1:7	8.76
1:7 + PP + NP	10.92

Table 11

BLEU scores of different proportion using En-to-As NMT (Transformer) (PP: phrase pairs, NP: noun phrases)

Ratio	Pre-train + Fine-tune
1:1	4.88
1:1 + PP + NP	7.48
1:2	4.92
1:2 + PP + NP	7.50
1:3	4.96
1:3 + PP + NP	7.34
1:4	5.08
1:4 + PP + NP	7.52
1:5	5.14
1:5 + PP + NP	8.06
1:6	4.94
1:6 + PP + NP	7.54
1:7	4.78
1:7 + PP + NP	7.47

Table 12

BLEU scores comparison of our system with baseline

Translation	System	BLEU
As-to-En	Our System	13.02
	Baseline (RNN) [11]	7.72
En-to-As	Our System	8.06
	Baseline (RNN) [11]	5.55

Table 13

TER and RIBES scores comparison of our system with baseline

Translation	System	TER	RIBES
As-to-En	Our System	84.64	0.496646
	Baseline (RNN)	91.64	0.300727
En-to-As	Our System	94.68	0.262370
	Baseline (RNN)	96.93	0.252370

Table 14

METEOR and F-measure scores comparison of our system with baseline

Translation	System	METEOR	F-measure
As-to-En	Our System	0.224786	0.493848
	Baseline (RNN)	0.183634	0.482956
En-to-As	Our System	0.096528	0.229843
	Baseline (RNN)	0.094548	0.228664

9 Conclusion and future work

This paper explores different NMT models explicitly for the En-As low-resource pair in both directions of translation. We have tackled the insufficient data and OOV issues for such low-resource pair translation. This investigation proposes an augmentation-based NMT approach by incorporating phrase pairs, leveraging large-scale synthetic parallel data, and handling the OOV problem to improve translational performance compared to the baseline system. In the future, we will increase the corpora size and investigate the multilingual-based transfer learning approach to tackle the insufficient data problem for further research.

Footnotes

Acknowledgment

The authors are thankful to the Department of Computer Science and Engineering and Center for Natural Language Processing (CNLP) at the National Institute of Technology, Silchar for providing infrastructure to execute this work.

References

Megerdoomian

and Parvaz

, Low-Density Language Bootstrapping: the Case of Tajiki Persian, in: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), European Language Resources Association (ELRA), Marrakech, Morocco, (2008).

Probst

, Brown

, Carbonell

, Lavie

, Levin

L.S.

and Peterson

, Design and Implementation of Controlled Elicitation for Machine Translation of Low-density Languages, (2001).

Hogan

, OCR for minority languages, in: Symposium on Document Image Understanding Technology, (1999).

, Hassan

, Devlin

and Li

V.O.K.

, Universal Neural Machine Translation for Extremely Low Resource Languages, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT USA, June 1-6, 2018, Volume 1 (Long Papers), M.A. Walker, H. Ji and A. Stent, eds, Association for Computational Linguistics, New Orleans, Louisiana, (2018), 344–354.

Denkowski

and Neubig

, Stronger Baselines for Trustable Results in Neural Machine Translation, in: Proceedings Proceedings of the First Workshop on Neural Machine Translation, Association for Computational Linguistics, Vancouver, (2017), 18–27.

Kocmi

, Exploring Benefits ofTransfer Learning in Neural Machine Translation, (2020).

Barman

, Sarmah

and Sarma

, AssameseWordNet based Quality Enhancement of Bilingual Machine Translation System, in: Proceedings of the Seventh Global Wordnet Conference, University of Tartu Press, Tartu, Estonia, (2014), 256–261.

Dutta

, Assamese Orthography: An Introduction and Some Applications for Literacy Development, in: Handbook of Literacy in Akshara Orthography, Springer International Publishing, Cham, (2019), pp. 181–194. ISBN ISBN 978-3-030-05977-4.

Saharia

, Das

, Sharma

and Kalita

, Part of Speech Tagger for Assamese Text, in: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, Association for Computational Linguistics, Suntec, Singapore, (2009), 33–36.

10.

Kalyanee Kanchan Baruah Pranjal Das

A.H.

and Sarma

S.K.

, Assamese-English Bilingual Machine Translation, International Journal on Natural Language Computing (IJNLC)3 (2014).

11.

Laskar

S.R.

, Khilji

A.F.U.R.

, Pakray

and Bandyopadhyay

, EnAsCorp1.0: English-Assamese Corpus, in: Proceedings of the 3rdWorkshop on Technologies for MT of Low Resource Languages, Association for Computational Linguistics, Suzhou, China, (2020), 62–68.

12.

Dave

, Parikh

and Bhattacharyya

, Interlingua-based English-Hindi Machine Translation and Language Divergence, Machine Translation16 (2001), 251–304.

13.

Koehn

, Statistical Machine Translation, 1st edn, Cambridge University Press, USA, 2010. ISBN ISBN 0521874157.

14.

Devlin

, Zbib

, Huang

, Lamar

, Schwartz

and Makhoul

, Fast and Robust Neural Network Joint Models for Statistical Machine Translation, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Baltimore, Maryland, (2014), pp. 1370–1380.

15.

Cho

, van Merriënboer

, C. Gulcehre

, Bahdanau

, Bougares

, Schwenk

and Bengio

, Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, (2014), 1724–1734.

16.

Sutskever

, Vinyals

and Le

Q.V.

, Sequence to Sequence Learning with Neural Networks, in: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, MIT Press, Cambridge, MA, USA, (2014), 3104–3112.

17.

Bahdanau

, Cho

and Bengio

, Neural Machine Translation by Jointly Learning to Align and Translate, in: 3rd International Conference on Learning Representations, ICLR 2015, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, eds, arXiv, San Diego, CA, USA, (2015), 1–15.

18.

Luong

, Pham

and Manning

C.D.

, Effective Approaches to Attention-based Neural Machine Translation, in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Lisbon, Portugal, (2015), 1412–1421.

19.

Ramesh

S.H.

and Sankaranarayanan

K.P.

, Neural Machine Translation for Low Resource Languages using Bilingual Lexicon Induced from Comparable Corpora, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, Association for Computational Linguistics, New Orleans, Louisiana, USA, (2018), 112–119.

20.

Kalchbrenner

and Blunsom

, Recurrent Continuous Translation Models, in: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Seattle,Washington, USA, (2013), 1700–1709.

21.

Gehring

, Auli

, Grangier

and Dauphin

, A Convolutional Encoder Model for Neural Machine Translation, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Vancouver, Canada, (2017), 123–135.

22.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomez

A.N.

, Kaiser

L.u.

and Polosukhin

, Attention is All you Need, in: Advances in Neural Information Processing Systems 30, I. Guyon, U.V. Luxburg, S. Bengio, H.Wallach, R. Fergus, S. Vishwanathan and R. Garnett, eds, Curran Associates, Inc., (2017), pp. 5998–6008.

23.

Ott

, Edunov

, Grangier

and Auli

, Scaling Neural Machine Translation, in: Proceedings of the Third Conference on Machine Translation: Research Papers, Association for Computational Linguistics, Brussels, Belgium, (2018), 1–9.

24.

Pathak

, Pakray

and Bentham

, English–Mizo Machine Translation using neural and statistical approaches, Neural Computing and Applications30 (2018), 1–17.

25.

Lalrempuii

and Soni

, Attention-Based English to Mizo Neural Machine Translation, in: Machine Learning, Image Processing, Network Security and Data Sciences, Springer Singapore, Singapore, (2020), 193–203. ISBN ISBN 978-981-15-6318-8.

26.

Pathak

and Pakray

, Neural Machine Translation for Indian Languages, Journal of Intelligent Systems (2018), 1–13.

27.

Ngo

T.-V.

, Ha

T.-L.

, Nguyen

P.-T.

and Nguyen

L.-M.

, Overcoming the Rare Word Problem for low-resource language pairs in Neural Machine Translation, in: Proceedings of the 6th Workshop on Asian Translation, Association for Computational Linguistics, HongKong, China, (2019), 207–214.

28.

Koehn

and Knowles

, Six Challenges for Neural Machine Translation, in: Proceedings of the First Workshop on Neural Machine Translation, Association for Computational Linguistics, Vancouver, (2017), 28–39.

29.

Papineni

, Roukos

, Ward

and Zhu

W.-J.

, BLEU: A Method for Automatic Evaluation of Machine Translation, in: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’ 02, Association for Computational Linguistics, Stroudsburg, PA, USA, (2002), 311–318.

30.

Snover

, Dorr

, Schwartz

, Micciulla

and Makhoul

, A study of translation edit rate with targeted human annotation, in: In Proceedings of Association for Machine Translation in the Americas, (2006), 223–231.

31.

Isozaki

, Hirao

, Duh

, Sudoh

and Tsukada

, Automatic Evaluation of Translation Quality for Distant Language Pairs, in: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Cambridge, MA, (2010), 944–952.

32.

Lavie

and Denkowski

M.J.

, The Meteor Metric for Automatic Evaluation of Machine Translation, Machine Translation23(2–3) (2009), 105–115.

33.

Sen

, Hasanuzzaman

, Ekbal

, Bhattacharyya

and Way

, Neural Machine Translation of Low-resource Languages using SMT Phrase Pair Injection (2018).

34.

Och

F.J.

and Ney

, A Systematic Comparison of Various Statistical Alignment Models, Computational Linguistics29(1) (2003), 19–51.

35.

Koehn

, Och

F.J.

and Marcu

, Statistical Phrase-Based Translation, in: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, (2003), 127–133.

36.

, Li

, Yang

and Dong

, A Diverse Data Augmentation Strategy for Low-Resource Neural Machine Translation, Inf11(5) (2020), 255.

37.

Rikters

, Impact of Corpora Quality on Neural Machine Translation, in: Human Language Technologies - The Baltic Perspective - Proceedings of the Eighth International Conference Baltic HLT 2018, Tartu, Estonia, 27-29 September 2018, Frontiers in Artificial Intelligence and Applications, Vol. 307, IOS Press, (2018), 126–133.

38.

Sennrich

, Haddow

and Birch

, Improving Neural Machine Translation Models with Monolingual Data, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Berlin, Germany, (2016), 86–96.

39.

Abdulmumin

, Galadanci

B.S.

and Garba

, Tag-less backtranslation, arXiv preprint arXiv:1912.10514 (2019).

40.

Aminian

, Ghoneim

and Diab

, Handling OOVWords in Dialectal Arabic to English Machine Translation, in: Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants, Association for Computational Linguistics, Doha, Qatar, (2014), 99–108.

41.

Luong

, Sutskever

, Le

, Vinyals

and Zaremba

, Addressing the Rare Word Problem in Neural Machine Translation, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Beijing, China, (2015), 11–19.

42.

Sennrich

, Haddow

and Birch

, Neural Machine Translation of Rare Words with Subword Units, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Berlin, Germany, (2016), 1715–1725.

43.

Bhat

I.A.

, Mujadia

, Tammewar

, Bhat

R.A.

and Shrivastava

, IIIT-H System Submission for FIRE2014 Shared Task on Transliterated Search, in: Proceedings of the Forum for Information Retrieval Evaluation, FIRE ’14, Association for Computing Machinery, New York, NY, USA, (2014), 48–53. ISBN ISBN 9781450337557.

44.

Pennington

, Socher

and Manning

C.D.

, Glove: Global Vectors for Word Representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, A meeting of SIGDAT, a Special Interest Group of the ACL, A. Moschitti, B. Pang andW. Daelemans, eds, ACL, Doha, Qatar, (2014), 1532–1543.

Improved neural machine translation for low-resource English–Assamese pair

Abstract

Keywords

1 Introduction

2.1 Machine translation

3 Dataset description and baseline system

Table 1 Data statistics of EnAsCorp1.0 Type Sentences Tokens En As Parallel Train 203,315 2,414,172 1,986,270 Validation 4,500 74,561 59,677 Test Set 2,500 41,985 34,643 Monolingual 2,624,828 - 45,900,321

Table 5 Train data statistics after phrase pairs injection Type Sentences T1: “original parallel corpus” 203,315 T2: “original parallel corpus” + Setp≥0.5 639,787 T3: “original parallel corpus” + Setp=1 445,957 T4: “original parallel corpus” + Set all 1,433,121

6 Proposed approach

8 Result and analysis

Footnotes

Acknowledgment

References

Table 1
Data statistics of EnAsCorp1.0

Type Sentences Tokens

En As

Parallel Train 203,315 2,414,172 1,986,270

Validation 4,500 74,561 59,677

Test Set 2,500 41,985 34,643

Monolingual 2,624,828 - 45,900,321

Table 5
Train data statistics after phrase pairs injection

Type Sentences

T1: “original parallel corpus” 203,315

T2: “original parallel corpus” + Set_p≥0.5 639,787

T3: “original parallel corpus” + Set_p=1 445,957

T4: “original parallel corpus” + Set_all 1,433,121