A more cost-efficient Chinese Named Entity Recognition based on trigger and matching network

Abstract

The lack of training data in new domain is a typical problem for named entity recognition (NER). Currently, researchers have introduced “entity trigger” to improve the cost-effectiveness of the model. However, it still required the annotator to attach additional trigger label, which increases the workload of the annotator. Moreover, this trigger applies only to English text and lacks research into other languages. Based on this problem, we have proposed a more cost-effective trigger tagging method and matching network. The approach not only automatic tagging entity triggers based on the characteristics of Chinese text, but also adds mogrifier LSTM to the matching network to reduce context-free representation of input tokens. Experiments on two public datasets show that our automatic trigger is effective. And it achieves better performances with automatic trigger than other state-of-the-art methods (The F1-scores increased by 1∼4).

Keywords

Chinese NER entity trigger Mogrifier LSTM TMN m-TMN

1 Introduction

Named entity recognition (NER) is a basic information extraction task. It focuses on extracting entities in an unstructured text and classifying them based on predefined entities (such as person, location, organization names) [1]. The latest progress of NER focuses on neural network-based models, such as BiLSTM-CRF [2], Lattice_LSTM [3] and so on. When the label set is fixed and has enough training data, most of the NER neural network models performed well [4 –9]. However, with the popularization and application of NER technology, such as task-oriented systems, it is challenged by the application domains of diversity and rapid changes [4]. As we all know, NER model training requires a lot of manual annotation data. In the face of many new domains, if we comply the previous model training methods, we need to annotate a large number of new training data, which is not economical. Therefore, how to use less labeled data to train the neural network model has become a significant research task in NER research.

In order to solve the problem, Lin et al., based on the characteristics of the English text, proposed the concept of entity trigger and the trigger matching network (TMN) framework [5]. And using the trigger-enhanced training data and manual entity annotations, this method uses 20% of the labeled data to achieve the experimental result of the traditional method using 70% of the training data. The data enhancement method (Zeng et al.,) using entity contextual content matching [6] is also an excellent work for this question and has a guiding role for future related research.

However, different language texts have different characteristics. Lin et al. only conducted related experiments in the field of English text. Whether it is equally effective in other language fields (such as Chinese) requires further experiments. Moreover, the trigger tagging work still needs to be done manually, this could pose a question: in a sentence, different annotators have a strong subjectivity about whether a word is a trigger or not, and these data may affect the results of the experiment. In order to solve this question and improve the efficiency of trigger annotation, can we adopt a fixed format and equally effective trigger tagging method based on the grammatical characteristics of the language text?

Therefore, we have made the following improvements: Based on our study of Chinese grammar and NER causality model [6], we assume that the entity triggers Chinese text can be simply considered as the context nearby entity, without requiring annotators to spend extra effort to analyze sentence triggers. For example, in the sentence

Fig. 7

“”

, the location entity

Fig. 8

“”

is highly correlated to its context

Fig. 9

“” and “”

. However, due to the complex and changeable structure of the training data, this mechanical entity trigger tagging method will introduce some non-trigger words. These extra words will affect the performance of the TMN. Therefore, in order to solve this problem, we proposed m-TMN. It uses the Mogrifier LSTM [25] structure to convert the input embedding condition into a cyclic state to solve the context-independent representation of the input tag.

In summary, our work has the following contributions:

We propose a cost-efficient entity trigger tagging method for Chinese text data. This method eliminates manual tagging, improves efficiency, and avoids the subjectivity of manual tagging.

Based on the above trigger annotation method, we propose a m-TMN model that strengthened semantic matching and generalization. Experiments on two public Chinese NER datasets show that on the same trigger-enhanced training data, our model performs better than the TMN model.

2 Trigger tagging method

The question we consider is how to quickly label entity triggers and verify their effectiveness. Then enhancement the original model according to the characteristics of the trigger.

In this section, we will introduce basic concepts and symbols, describe the traditional NER data annotation process and its way to annotate English entity triggers. On this basis, we propose an annotation method for the Chinese entity trigger. Finally, we present a task of training a Chinese NER model based on Chinese trigger-enhanced data.

In traditional neural network supervised learning, c = [c⁽¹⁾ ; c⁽²⁾ ; … ; c⁽ⁿ⁾] denoted a sentence in the data set D_L. Each tagged sentence corresponds to a NER tag sequence s = [s⁽¹⁾ ; s⁽²⁾ ; … ; s⁽ⁿ⁾]. And $s^{(i)} \in S$ , the form of $S$ can be {O, B-PER, I-PER, E-LOC,...}. The entity marking and segmentation mode used may be BIO or BIOES. Therefore, we have the training corpus D_L ={ (c_i, s_i) } and the unlabeled corpus D_U ={ c_i }. We use $T (c, s)$ to represent a set of annotated entity triggers. Each trigger $T (c, s)$ corresponds to an entity index e and a set of word indexes {v_i}.

For the English data set, according to the manual tagging method of Lin et al., [5], $T = ({v_{1}, v_{2}, \dots} \to e)$ , where e and v are integers in the range of [1, |c|]. In the example shown in Fig. 1(a), “I had” can be represented as trigger t₁ = ({ 3, 4 } → 8) of the entity “Dongbei Restaurant”, because this trigger specifies the entity starting from index 3. Therefore, the entity of this sentence has a corresponding trigger $T (c, s) = {t_{1}, t_{2}}$ .

Fig. 1

We show the trigger tagging method for English text and Chinese text. (a) “I had” and “at” are the triggers of the entity “Dongbei Restaurant”, which need to be manual tagging. (b) Fig. 10

“” and “”

are the triggers of the entity

Fig. 11

“”

. They are distributed around the entity.

For the Chinese data set, inspired by Chinese text expression and recent NER data enhancement work [6], we use the index of the first character of the entity as the entity index and the two characters around the entity as the trigger of the entity. The concrete tagging method is as follows: for entities located in the middle of the sentence, we choose two characters on each side of the entity as entity triggers; for entities located at the beginning or end of a sentence, only two characters on one side of the sentence are selected as entity triggers; If there are fewer than two characters on one side of the entity, only one character is selected as the entity trigger.

In the example shown in Fig. 1(b),

Fig. 12

“”

can be represented as trigger t₁ = ({ 3, 4 } → 5) of the entity

Fig. 13

“”

, because this trigger specifies the entity starting from index 2. Therefore, the entity of this sentence has a corresponding trigger

T (c, s) = {t_{1}, t_{2}}

. It can be seen from the Fig. 1 that compared with English triggers; Chinese triggers are more distributed in the characters around the entity. Therefore, we propose to label the two characters around the entities in the Chinese data set as its triggers by programming.

Add triggers according to the above method to create a new data format as $D_{T} = {c_{i}, s_{i}, T (c_{i}, s_{i})}$ . One of the research goals of this paper is to use the above method to annotate Chinese text triggers to get $D_{T}$ . Then to verify the effectiveness of the triggers by comparing with the traditional D_L model. The second research goal of this paper is to improve the TMN based on the characteristics of this trigger. Then compare the performance of the TMN model and the new model based on the same $D_{T}$ .

3 Model

In this section, we first introduce the latest Mogrifier LSTM structure, and then propose m-TMN, which is mainly composed of Trigger Encoder, Semantic Matching Module and Sequence Tagging. Finally, we simple introduce the classical model BiLSTM-CRF [9].

3.1 Mogrifier LSTM

Mogrifier LSTM is an improvement of LSTM. The standard LSTM update [21] is as follows, Input and state are represented by m and n respectively $\begin{matrix} LSTM : ℝ^{m} * ℝ^{n} * ℝ^{n} \to ℝ^{n} * ℝ^{n} \end{matrix}$ (1) $\begin{matrix} LSTM (x, c_{prev}, h_{prev}) = (c, h) \end{matrix}$ (2)

At each time step t (t ∈ [1, …, m]), the standard LSTM calculates a current hidden vector h^(t) based on a memory cell c^(t). In particular, a set of input gate i^(t), output gate o^(t) and forget gate f^(t) are calculated as follows: $\begin{matrix} [\begin{matrix} i^{(i)} \\ \begin{matrix} o^{(t)} \\ \begin{matrix} f^{(t)} \\ {\tilde{c}}^{(t)} \end{matrix} \end{matrix} \end{matrix}] = [\begin{matrix} σ \\ \begin{matrix} σ \\ \begin{matrix} σ \\ \tanh \end{matrix} \end{matrix} \end{matrix}] (W [h^{(t - 1)}; w^{(t)}] + b) \end{matrix}$ (3) $\begin{matrix} c^{(t)} = i^{(t)} ⊙ {\tilde{c}}^{(t)} + f^{(t)} ⊙ c^{(t - 1)} \end{matrix}$ (4)

Where [W_k ; b_k] are weight matrices and biases. σ denotes the logistic sigmoid function; ⊙ denotes elementwise multiplication.

Mogrifier LSTM is based on LSTM, equip the gates that scale the columns of all its weight matrices W^** in a context-dependent manner. Therefore, its two inputs x and h_prev are mutually modulated in an alternating manner before the LSTM calculation takes place (see Fig. 2). That is to say, $Mogrify (x, c_{prev}, h_{prev}) = LSTM (x^{↑}, c_{prev}, h_{prev}^{↑})$ where the inputs x and h_prev are defined as the highest indexed xⁱ and $h_{prev}^{i}$ . $\begin{matrix} x^{i} = 2 σ (Q^{i} h_{prev}^{i - 1}) ⊙ x^{i - 2}, for odd \\ i \in [1 \dots r] \end{matrix}$ (5)

Fig. 2

Mogrifier has 5 rounds of updates. The previous state h⁰ = h_prev fed through a sigmoid and gates x^-1 = x in an elementwise manner producing x¹. Then, x¹ gates h⁰ produces h². After several repetitions of this mutual gating cycle, the final values of the h^* and x^* sequences are fed to the LSTM cell.

$\begin{matrix} h_{prev}^{i} = 2 σ (R^{i} x^{i - 1}) ⊙ h_{{prev}^{'}}^{i - 2} for even \\ i \in [1 \dots r] \end{matrix}$ (6) with x^-1 = x and $h_{prev}^{0} = h_{prev}$ . The only hyperparameter is the number of “rounds”, $r \in ℕ$ . And $Q^{i} = Q_{left}^{i} Q_{right}^{i}$ , $Q^{i} \in ℝ^{m \times k}$ , $Q_{right}^{i} \in ℝ^{k \times n}$ with k < min(m ; n).

3.2 Mogrifier trigger matching networks

Mogrifier Trigger Matching Networks is an improvement of TMN proposed by us. Based on the characteristics of Chinese text and cost-efficient Chinese trigger, it improves the original trigger encoding method and combines the Mogrifier LSTM structure to solve the context-independent representation of the input tag.

3.2.1 Trigger encoder and semantic matching module

Recall the trigger example “” and “” discussed in Section 2 to see that attention-based matching between entity triggers and sentences is necessary. Therefore, learning trigger representation and semantic matching are two inseparable tasks. The trigger vector should capture the semantics in a shared embedding space and token the hidden state. At this stage, the TMN contains a shared embedding space to jointly train the trigger encoder and attention-based trigger matching module.

Specifically, in a sequence s containing multiple entities {e₁, e₂… }, for each entity e_i, we assume that there is a set of triggers $T_{i} = {t_{1}^{(i)}, t_{2}^{(i)}, \dots}$ . For more efficient training, we constructed a data set $D_{T}$ based on trigger markers. We then create a training instance by pairing each entity with its triggers, denoted $(c, e_{i}, t_{j}^{(i)})$ .

For the triggers of the English data set are manually labeled and have a high matching degree, the public TMN uses LSTM to encode the training instances. For Chinese data sets and program tag triggers, there is no such high matching degree, so it is not appropriate to use LSTM encoding. Based on the strengths of the above-mentioned Mogrifier LSTM, we decided to first apply a Mogrifier LSTM on the training instance (c, e, t) to obtain a hidden state sequence. We use H to denote the matrix containing the hidden vectors of all entity tokens, and we use Z to denote the matrix containing the hidden states of all trigger t. Our new model is called m-TMN.

Then, we follow the self-attention method of Lin et al. to learn triggers and sentence representations [23]. As follows: ${\vec{a}}_{sent} = SoftMax (W_{2} tanh (W_{1} H^{T}))$ (7) $g_{s} = {\vec{a}}_{sent} H$ (8) ${\vec{a}}_{trig} = SoftMax (W_{2} tanh (W_{1} Z^{T}))$ (9) $g_{t} = {\vec{a}}_{trig} Z$ (10)

W₁ and W₂ are two trainable parameters for computing self-attention score vectors ${\vec{a}}_{sent}$ and ${\vec{a}}_{trig}$ . we get a sentence vector g_s. It represents the weighted sum of the token vectors in the entire sentence. Similarly, g_t is the final trigger vector, which represents the weighted sum of the token vectors in the trigger.

m-TMN uses the type of entity corresponding to the trigger as supervision to guide its representation. Therefore, the trigger vector g_t is further fed to the multi-class classifier to predict its corresponding entity e type. The loss of the trigger classification is as follows: $\begin{matrix} L_{TC} = - Σ logP (type (e) | g_{t}; θ_{TC}) \end{matrix}$ (11)

Next, we think that the trigger and its matched sentence have similar vector representations. Therefore, we use contrastive loss [24] to learn match triggers and sentences. The training process of the matching module is as follows: We first randomly mix triggers and sentences, so that we have two training instances (matches and mismatches). Then two training instances are fed to the semantic matching module. The contrastive loss of the matching is defined as follows, if it is a matching instance, then l_matched is1, otherwise it is 0: $d = {| | g_{s} - g_{t} | |}_{2}$ (12) $\begin{matrix} L_{SM} = (1 - l_{matched}) \frac{1}{2} {(d)}^{2} \\ + l_{matched} \frac{1}{2} {max (0, m - d)}^{2} \end{matrix}$ (13)

Thus, the joint loss of the first stage is L = L_TC + λL_SM, where λ is a hyper-parameter. Figure 3 shows this joint training process.

Fig. 3

Jointly train the Trigger encoder (via trigger classification) and the semantic matching module (via contrastive loss).

3.2.2 Trigger-enhanced sequence tagging and inference on unlabeled sentences

After obtaining the mean of trigger vector ${\hat{g}}_{t}$ as the query, follow the traditional attention method [26], we create a sequence of attention-based token representations . $\vec{a} = SoftMax (v^{T} tanh {(U_{1} H^{T} + U_{2} {\hat{g}}_{t}^{T})}^{T})$ (14) $H^{'} = \vec{aH}$ (15)

U₁, U₂ and v are trainable parameters for calculating the trigger enhanced attention scores. Then, we concatenate the original token representation H with the trigger enhancement H₀ as the input for the CRF tagger.

When inferring tags on unlabeled sentences, we use trigger matching to calculate the similarity between the sentence and the trigger and finally select the best trigger as the additional input of sequence tagger. Specifically, we obtained a trigger dictionary from the training data, T = {t| (. , . , t) ∈ D_T}. When inferring an unlabeled sentence, we first train and calculate its self-attended vector g_s and ${\hat{g}}_{t}$ . Then use it as the attention query of the sequence tagger. In this way, the model can predict unlabeled sentences by triggering enhancement.

3.3 BiLSTM-CRF model

The BiLSTM model [9] is one of the classic models in the NER task. This model consists of embedding, BiLSTM layer, tanh layer and CRF layer, etc. First, through the embedding layer, the sentence is represented as a sequence text of vector S = (s₁, …, s_t, …, s_n). In the BiLSTM layer, the forward LSTM calculates the representation ${\vec{h}}_{t}$ , and the other backward LSTM calculates the representation $h_{t}^{\leftarrow}$ of the same sequence. Therefore, the encoding of a word is $h_{t} = [{\vec{h}}_{t}; h_{t}^{\leftarrow}]$ . Then, the output is sent to thah layer to predict confidence scores for the word, that is, the output score of each possible tag of the word. The last CRF layer is used to decode the best tag among all possible tags.

4 Experiments

In this section, we first discuss how to collect the triggers in the Chinese text data set, and then verify the effect of the triggers annotated with this method and the strength of our new framework m-TMN.

4.1 Tagging of trigger

For the Chinese text field of NER, we use two public data sets for research. They are Resume NER [10] in the field of resume and Weibo NER [11] in the field of social media. Table 1 shows the statistics of these two data sets. These two data sets have been well researched and popularized in NER models such as BiLSTM-CRF.

Table 1
Statistics of datasets

Datasets Type Train Dev Test

Resume Sentence 3.8k 0.46k 0.48k

Char 124.1k 13.9k 15.1k

Weibo Sentence 1.4k 0.27k 0.27k

Char 73.8k 14.5k 14.8k

Datasets	Type	Train	Dev	Test
Resume	Sentence	3.8k	0.46k	0.48k
	Char	124.1k	13.9k	15.1k
Weibo	Sentence	1.4k	0.27k	0.27k
	Char	73.8k	14.5k	14.8k

In order to collect the triggers of Chinese data sets efficiently, we have studied the Chinese text expression and done a lot of experiments. Finally, we decided to programmatically label two characters around the entity as its trigger. (other numbers of characters experiments performed poorly). The details are introduced in section 2. In the field of Chinese text NER, our trigger tagging method can be completed by programs instead of manual labor, eliminating the labor costs. The effectiveness of this new trigger is demonstrated in the following experiments.

4.2 Set up

Trigger efficiency verification. We need a basic model to compare with the public TMN model to verify whether the triggers that are labeled by our proposed method are effective. We choose BiLSTM-CRF [9] as our basic model because its popularity in neural network and application research. In addition, the hyperparameters of BiLSTM and CRF of the two models are also the same, which can ensure a fair comparison between the base model and the TMN model. We believe that the effective experimental result of the trigger should be that the TMN model can experiment with the experimental results of the traditional model based on less training data. Therefore, during the training process, we divide the data set with different fixed percentages (5% ∼100%) to train two models at the same time.

Performance strengths verification of m-TMN model. We need to compare the performance of our m-TMN model with the standard TMN model under the same number of training sets and triggers. The division of the data set is the same as above. The experimental method and data set division are based on the experimental settings of Lin et al. [5]

4.3 NER evaluation

The evaluation index generally used by NER are P(Precision), R(Recall), and F1 value.The precision of the NER is the ratio of the number of completely correct entities predicted by the model to the number of all entities predicted by the model. $\begin{matrix} Precision = \frac{TP + TN}{TP + FN + FP + TN} \end{matrix}$ (16)

The TP is true positive, the FN is false negative, the FP is false positive and the TN is true negative. The recall is the ratio of the number of completely correct entities predicted by the model to the number of real entities in the sample. $\begin{matrix} Recall = \frac{TP}{TP + FN} \end{matrix}$ (17)

It should be emphasized that the NER is label words(characters) and its purpose is to find entities. So instead of calculating the accuracy of the label, it is calculating the precision and recall of the predicted entity. The F1 value is calculated from P(Precision) and R(Recall). $\begin{matrix} \frac{1}{F 1} = \frac{1}{Recall} + \frac{1}{Precision} \end{matrix}$ (18)

4.4 Word embeddings

In order to compare the application effect of m-TMN and the relationship between pretrained word embedding, we did two sets of experiments, random word embedding and Chinese word embedding. We chose Chinese pre-trained word vector data published by Tencent AI Lab. This data contains 100,000 Chinese words, each of which is represented as a 200-dimensional vector. The word vector data is based on Tencent’s large-scale and multi-source corpus, so that the generated word vector data can cover a variety of domains. And the training algorithm is the Directional Skip-Gram (DSG) algorithm self-developed by Tencent AL Lab. [29] The algorithm is based on Skip-Gram (SG) [28]. Based on the co-occurrence relationship of word pairs, it considers the relative positions of word pairs and improves the accuracy of semantic representation of word vectors.

4.5 Result

The experimental results are shown in Tables 2 3.

Table 2
Results on datasets: Resume and Weibo with random word embedding. “sent.” means the percentage of the sentences (labeled only with entity tags). “trig” means the percentage of the sentences (labeled with both entity tags and trigger tags)

Resume

BiLSTM-CRF TMN m-TMN

Sent. P R F1 trig. P R F1 P R F1

5% 73.57 76.69 75.10 5% 79.37 81.28 80.32 84.21 85.09 84.65

7% 81.29 81.04 81.17 7% 80.56 86.01 83.20 84.22 85.10 84.66

10% 83.41 84.23 83.82 10% 83.88 86.20 85.02 88.11 88.65 88.38

13% 84.46 87.36 85.89 13% 86.03 87.24 86.63 88.25 89.82 89.02

20% 86.89 89.08 87.97 15% 86.19 89.65 87.88 89.56 90.43 90.04

30% 88.00 90.92 89.44 17% 87.76 90.18 88.96 89.63 91.23 90.42

40% 89.90 91.17 90.53 20% 89.52 90.12 89.52 89.81 91.35 90.57

Weibo

BiLSTM-CRF TMN m-TMN

sent. P R F1 trig. P R F1 P R F1

5% 37.18 20.81 26.69 5% 33.74 26.51 26.69 41.09 27.23 32.75

10% 41.90 28.47 33.90 10% 42.30 30.86 35.68 39.19 34.69 36.80

20% 45.99 31.58 37.45 13% 38.03 34.21 36.02 38.85 35.41 37.05

30% 46.15 34.45 39.45 15% 41.62 33.25 36.97 41.23 33.73 37.11

40% 43.94 39.00 41.32 17% 39.11 37.80 38.44 48.08 32.78 38.65

50% 45.53 39.00 42.01 20% 48.64 34.21 40.17 44.62 39.71 42.03

Resume
	BiLSTM-CRF	TMN	m-TMN
Sent.	P	R	F1	trig.	P	R	F1	P	R	F1
5%	73.57	76.69	75.10	5%	79.37	81.28	80.32	84.21	85.09	84.65
7%	81.29	81.04	81.17	7%	80.56	86.01	83.20	84.22	85.10	84.66
10%	83.41	84.23	83.82	10%	83.88	86.20	85.02	88.11	88.65	88.38
13%	84.46	87.36	85.89	13%	86.03	87.24	86.63	88.25	89.82	89.02
20%	86.89	89.08	87.97	15%	86.19	89.65	87.88	89.56	90.43	90.04
30%	88.00	90.92	89.44	17%	87.76	90.18	88.96	89.63	91.23	90.42
40%	89.90	91.17	90.53	20%	89.52	90.12	89.52	89.81	91.35	90.57
Weibo
	BiLSTM-CRF	TMN	m-TMN
sent.	P	R	F1	trig.	P	R	F1	P	R	F1
5%	37.18	20.81	26.69	5%	33.74	26.51	26.69	41.09	27.23	32.75
10%	41.90	28.47	33.90	10%	42.30	30.86	35.68	39.19	34.69	36.80
20%	45.99	31.58	37.45	13%	38.03	34.21	36.02	38.85	35.41	37.05
30%	46.15	34.45	39.45	15%	41.62	33.25	36.97	41.23	33.73	37.11
40%	43.94	39.00	41.32	17%	39.11	37.80	38.44	48.08	32.78	38.65
50%	45.53	39.00	42.01	20%	48.64	34.21	40.17	44.62	39.71	42.03

Table 3

Results on datasets: Resume and Weibo with Tencent word embedding. “sent.” means the percentage of the sentences (labeled only with entity tags). “trig” means the percentage of the sentences (labeled with both entity tags and trigger tags)

Resume
	BiLSTM-CRF			TMN			m-TMN
Sent.	P	R	F1	trig.	P	R	F1	P	R	F1
5%	79.57	77.42	78.48	5%	82.53	84.91	83.70	86.66	87.32	85.97
7%	82.78	82.58	82.68	7%	85.40	87.11	86.24	87.92	89.26	88.58
10%	85.71	82.76	84.21	10%	85.80	88.22	86.99	88.84	90.80	89.81
20%	87.92	89.26	88.58	15%	88.90	90.43	89.66	89.28	90.98	90.12
30%	90.04	91.53	90.78	17%	89.28	90.80	89.81	89.70	90.86	90.28
40%	90.39	91.72	91.05	20%	90.09	90.37	90.23	90.72	91.72	91.21
Weibo
	BiLSTM-CRF			TMN			m-TMN
sent.	P	R	F1	trig.	P	R	F1	P	R	F1
5%	23.40	18.42	20.62	5%	38.62	22.89	28.74	37.64	23.74	29.12
10%	47.14	25.60	33.18	10%	37.99	25.42	30.46	44.81	29.95	35.17
20%	58.62	28.47	38.33	13%	38.85	25.90	31.08	47.00	35.65	40.54
30%	56.02	32.30	40.97	15%	48.06	29.67	36.69	47.38	36.84	41.45
40%	59.48	33.01	42.46	20%	48.70	31.34	38.14	48.07	38.76	42.91

Tables 2 3 respectively in the random word model training vectors and Tencent pretraining word vector result. TMN is a model proposed by Lin et al. and others for English triggers [5]. We train on the Chinese dataset based on the model combined with the Chinese trigger tagging method of this paper. M-TMN is our improved TMN model based on the characteristics of Chinese data. It is also trained on the Chinese data set based on the Chinese trigger tagging method. After many demonstrations and experiments, we finally selected 5% ∼40% of the training data of the Resume NER dataset and 5% ∼50% of the training data of the Weibo NER dataset to compare and verify the experimental data. From the comparison between the BiLSTM-CRF column and the TMN column, it can be seen that the trigger collected in our method are efficient. Compared with the BiLSTM-CRF model, the TMN model based on this trigger has a great improvement. From the comparison between the TMN column and the m-TMN column, we can see that the m-TMN model we proposed for Chinese NER data and trigger characteristics also obtained ideal results.

Random word embedding: For the Resume NER dataset, using 5% ∼20% of the training data, the TMN model based on cost-efficient triggers has achieved ideal results. Compared with the traditional BiLSTM-CRF method, the F1 value of the result obtained by TMN under the same minor data is increased by about 1∼5. And the less training data, the greater the improvement effect.

Compared with TMN, our m-TMN under the same amount of training data and trigger conditions, the F1 value of the result is increased by about 1∼4. Compared with the traditional model BiLSTM-CRF method, m-TMN model is based on our trigger at the same training data, F1 values are increased 2∼9. Using only 20% of the training data can achieve the effect of the traditional model with 40% of the training data.

Fig. 4

The architectures of BiLSTM-CRF model (a) and our trigger-based global attention BiLSTM-CRF model (b).

Fig. 5

Normalized loss(Y axis) on validation set, trained with 20% labeled data, over different epochs(X axis). (a) is Resume NER dataset, (b) is Weibo NER dataset.

Similarly, for Weibo NER data set, using smaller amount of training data (below 20%), the F1value of our m-TMN model based on cost-efficient trigger training is about 5∼6 higher than that of the traditional BiLSTM-CRF model. It can be seen that based on the same training data, the F1 value of the m-TMN model has increased by 5–8, and the m-TMN model can achieve the effect of 50% of the traditional model with only 20% of the training data.

Tencent word embedding: On the Resume NER data set, the score of model training is improved by 1∼2 compared with the random word embedding. And it can also be seen that under 20% of the training data, the F1 value of the m-TMN model is improved by 2∼7 points than the BiLSTM-CRF model, and 1∼2 points higher than the TMN model.

On the Weibo data set, under the same minor data, the training score of the m-TMN model is 3–9 higher than the BiLSTM-CRF model, and 1–4 higher than the TMN model.

Fig. 6

The influence of the number of status updates r on the F1 value of the experimental results. (a) is Resume NER dataset, (b) is Weibo NER dataset.

In order to show that our m-TMN model is superior to the public TMN on Chinese text NER, we plot the loss curve of m-TMN and TMN on the validation set using 20% training data in Fig. 5. As shown in the figure, after training for a few epochs, the loss curve of the m-TMN model drops faster than the TMN model. After training for a few epochs, the loss curve of the m-TMN model drops faster than the TMN model. As the number of training increases, the loss curve of m-TMN also converges earlier and has a smaller value than TMN. This shows that the performance of our m-TMN model in Chinese NER tasks based on our triggers has a greater improvement than TMN. Moreover, the strengths of m-TMN are not only reflected in the F1 evaluation index, but also has a significant effect on shortening training time.

Additionally, as a further test of the impact of the hyperparameter r (number of status updates) in Mogrifier LSTM, we also make a comparison among different values. We set the r value to 1∼9 and did the model performance test respectively. The training data is the 10% Resume NER dataset and 10% Weibo NER dataset. The experimental results are shown in Fig. 6. The results show that the model performs best when the number of status updates is 5 times. Too many status updates will have a negative impact on model performance.

To illustrate that Chinese entities and their around words is much greater than that of other languages, we use the proposed trigger labeling method to annotate entities on English domain datasets and conduct comparative experiments on Chinese datasets. Unlike Chinese text, which is character-based, English text is word-based, so we choose a word around the entity as its trigger.The experimental results are shown in Table 4. Table 4 shows that in the English dataset, using words around entities as its triggers did not improve model performance.The effect of this trigger is lower than the artificial trigger of Lin et al. and even it makes the model perform lower than BiLSTM-CRF. For example, in the CONLL2003 dataset, using the same 20% ratio of training data, our proposed trigger makes the F1 value of the model about 10 lower than that of Lin et al, and about 4 lower than that of BiLSTM-CRF. While in the Resume dataset, our proposed triggers make the F1 value of the model higher than that of BiLSTM-CRF. Thus, Chinese entities and their around words is much greater than that of other languages.

Finally, a note about the cost-effectiveness of the model. Lin et al. consider the extreme case that tagging triggers requires twice the human effort, the TMN is still significantly more labor-efficient in terms of F1 scores [5]. The m-TMN model proposed in this paper realizes the automatic tagging of triggers, eliminating the need for manual tagging of triggers. Although the model complexity will increase, the model training results are significantly improved without increasing manual labor, which makes the m-TMN model more cost-effective.

5 Discussion

In this section, we will first review our research results, and then try to answer some questions that deeply understand our work. Then, we put forward some limitations of the methods found so far to guide future research. We hope that these limitations will help readers understand our method more deeply.

Table 4
Comparison of triggers in different languages: The training dataset selection ratio is 20%

Data Languages BiLSTM- Trigger type TMN m-TMN

CRF

Resume Chinese 87.97 Lin et al.2020 – –

proposed 89.52 90.57

Weibo Chinese 37.45 Lin et al.2020 – –

proposed 40.17 42.03

CONLL 2003 English 81.3 Lin et al.2020 86.01 86.2

proposed 76.2 75.4

BC5CDR English 69.92 Lin et al.2020 73.97 74.07

proposed 61.23 62.14

Data	Languages	BiLSTM-	Trigger type	TMN	m-TMN
Resume	Chinese	87.97	Lin et al.2020	–	–
proposed	89.52	90.57
Weibo	Chinese	37.45	Lin et al.2020	–	–
proposed	40.17	42.03
CONLL 2003	English	81.3	Lin et al.2020	86.01	86.2
proposed	76.2	75.4
BC5CDR	English	69.92	Lin et al.2020	73.97	74.07
proposed	61.23	62.14

5.1 Analysis

Our method has made significant improvements on two public data sets, but there are still some problems that haunt us. Q1: Why does the trigger of Chinese text only need to be set as the entity context to get the effect of well, but other languages can’t? Q2: Why does this method perform well on the smaller amount of training data? Q3: After using Mogrifier LSTM to improve the LSTM coding layer for trigger matching, why the experimental results are better?

Answer for Q1: The essence of entity triggers is the context in the text that is strongly related to the entity. According to our research on the written expression and grammar of Chinese texts, the correlation between Chinese entities and their around words is much greater than that of other languages. Therefore, we choose to use the program for trigger tagging. Although this method is less accurate than manual marking, it has the strengths of high efficiency and eliminates the subjectivity of manual marking. The experimental results also proved that this kind of trigger performs well.

Answer for Q2: The method of trigger attention enhancement, its essence is to encode entities and triggers separately, and then perform semantic matching again. The research of Zeng shows that the NER model pays more attention to the entity rather than the context [6]. Agarwal et al. also found that entity representation contributes more to model performance than context representation [12]. Therefore, to certain extent, context representation may have more spurious correlations between the input features and output labels. Through the semantic matching of the entity and the context, the spurious correlation between the context representation and the variant features in the output label is eliminated. Answer for Q3: The triggers that are automatically marked may have a certain mismatch, and these mismatched triggers will affect the performance of the trigger matching model. The design motivation of Mogrifier LSTM is to solve context-free representation of the input token by conditioning the input embedding on the recurrent state some benefit was indeed derived. And this design is more applicable to character-level tasks, and the Chinese NER task is sometimes a character-level task [10]. So, we add the Mogrifier LSTM structure to the semantic matching module. Experiments prove that our design has greatly improved the performance of the model.

5.2 Limitations and future work

Although our work has realized the automatic collection of triggers, we have also improved the model based on the characteristics of this trigger. However, the trigger form of automatic tagging is still relatively fixed and currently only applies to Chinese text data. Our future work directions include: 1) Develop a more intelligent trigger annotation method and improve the model for the new method. 2) Further research on trigger automatic tagging methods for low-resource languages.

6 Related work

When facing smaller amount of training data, the most direct method is data enhancement. That is, high-quality samples are selected to expand the training data. Sample selection is the core module of data-enhanced NER. It selects samples with high confidence and large amount of information to participate in training through certain measurement criteria. A typical idea is Active learning sampling. A typical idea is Active learning sampling. For example, Shen et al., use the “uncertainty” standard d to improve data quality by mining the intrinsic information of the entity [13]. This method focuses on instance sampling and manual annotation, and it requires annotators to first label the most useful instance. However, a recent study believes that manually annotated data is almost not helpful for training new models [14]. NER based on feature transformation is also one of the important methods to solve this problem. This method transfers features to each other or maps the data features of the source and target domains to the agreed feature space [15] to reduce the learning process of differences between domains. Daume et al., preprocess the feature space to achieve the combination of target domain and source domain features [16]. In a task with only two domains, expand the feature space R^F to R^3F, which corresponds to the domain problem, expand the feature space to R^(K+1)F. Qu et al. started with domain and label differences, first trained large-scale source domain data, then measured the correlation between the source domain and target domain entity types, and finally fine-tuned by means of model migration [17]. Another way to solve this problem is NER based on knowledge links. That is, structured resources such as ontologies and knowledge bases are used to heuristically mark data, and the structural relationships of the data are shared objects to help solve the target NER task. It is essentially a learning method based on remote supervision, using external knowledge bases and ontology libraries to supplement annotated entities. Richman et al., used Wikipedia knowledge to design a NER system [18]. This method uses Wikipedia category links to associate phrases with category sets, and then determines the type of phrase. Similarly, Pan et al., used a series of knowledge base mining methods to develop a cross-language name tag and link structure for more than 200 languages [19]. Although these methods greatly reduce the workload of manual annotation, the quality of matching sentences largely depends on the coverage of the dictionary and the quality of the corpus. The learned models tend to have similar entities in the dictionary. There are also some works that focus on redefining NER as a different problem to reduce the need for manually labeled training data. For example, the chain rule (Safranchik et al) is based on votes recognize entities through whether adjacent elements in a sequence belong to the same class [20]. Different from the work aimed at getting rid of training data or manual annotations, Lin er al proposed a new and effective human interpretation agent “entity trigger” to promote the effective learning of NER models [5]. But there are problems such as how to automatically generate triggers, low-resource language migration, and improvement of modeling methods.

7 Conclusion

In this paper, we propose a cost-efficient trigger tagging method for Chinese text based on the NER model enhanced by triggers. Then, in view of the characteristics of this trigger, we improved the standard TMN framework. Experiments on two public data sets show that this new trigger tagging method is not only economical but also effective. And our improved m-TMN framework also has a better performance improvement than the previous model. Therefore, we believe that our work has great research significance in the Chinese NER task with limited training data. At the same time, this is an exploration of the application of Mogrifier LSTM.

Footnotes

Acknowledgments

This research was supported by National Natural Science Foundation of China (No. 61871234).

References

David Nadeau and Satoshi Sekine , Asurvey of named entity recognition and classification, Lingvisticae Investigationes 30(1) (2007), 3–26.

Zhiheng Huang , Wei Xu and Kai Yu , Bidirectional lstm-crf models for sequence tagging, arXiv preprint arXiv:1508.01991, 2015.

Yue Zhang and Jie Yang , Chinese ner using lattice lstm, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), 1554–1564, 2018.

Hou

, Zhou

, Liu

, Wang

, Che

, Liu

and Liu

, Few-shot sequence labeling with label dependency transfer and pair-wise embedding, arXiv preprint arXiv:1906.08711, 2019.

Bill Yuchen Lin , Dong-Ho Lee , Ming Shen , Ryan Moreno , Xiao Huang , Prashant Shiralkar and Xiang Ren , Triggerner: Learning with entity triggers as explanations for named entity recognition, In Proceedings of ACL, 2020.

Zeng

, Li

, Zhai

and Zhang

, Counterfactual Generator: A Weakly-Supervised Method for Named Entity Recognition, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.

Hasim Sak , Andrew Senior

and Françoise Beaufays , Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition, CoRR, abs/1402.1128, 2014. URL http://arxiv.org/abs/1402.1128

Zhouhan Lin , Minwei Feng , Cicero Nogueira dos Santos , Mo Yu , Bing Xiang , Bowen Zhou and Yoshua Bengio , A structured self-attentive sentence embedding, In Proc. of ICLR, 2017b.

Xuezhe Ma and Eduard Hovy

, End-to-end sequence labeling via bi-directional lstm-cnns-crf, In Proc of ACL, 2016.

10.

Yue Zhang and Jie Yang , Chinese ner using lattice lstm, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), 1554–1564, 2018.

11.

Nanyun Peng and Mark Dredze , Improving named entity recognition for chinese social media with word segmentation representation learning, In ACL, page 149, 2016.

12.

Oshin Agarwal , Yinfei Yang , Byron Wallace

and Ani Nenkova , Interpretability analysis for 7279 named entity recognition to understand system predictions and how they can improve, CoRR, abs/2004.04564, 2020.

13.

Shen

, Yun

, Lipton

Z.C.

, et al., Deep active learning for named entity recognition[J], arXiv preprint arXiv:1707.05928, 2017.

14.

Zachary Chase Lipton and Byron Wallace

, Practical obstacles to deploying active learning, In EMNLP/IJCNLP, 2018.

15.

Sinno

J.P.

, Ivor

W.T.

, James

T.K.

, et al., Domain adaptation via transfer component analysis[J], IEEE Transactions on Neural Networks 22(2) (2010), 199–210.

16.

Young

B.K.

, Karl

, Rruhi

, et al., New transfer learningtechniques for disparate label sets[C], Proceedings of the 53rdAnnual Meeting of the Association for Computational Linguistics andthe 7th International Joint Conference on Natural Language Processing 1 (2015), 473–482.

17.

Lizhen

, Gabriela

, Liyuan

, et al., Named Entity Recognition for Novel Types by Transfer Learning[C], Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, 899–905.

18.

Alexander

E.R.

and Patrick Mining

, Wiki Resources for Multilingual Named Entity Recognition[C], Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies, 2008, 1–9.

19.

Xiao

M.P.

, Bo

L.Z.

, Jonathan

, et al., Cross-lingual name tagging and linking for 282 languages[C], Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, 1946–1958.

20.

Esteban Safranchik , Shiying Luo and Stephen Bach

, Weakly supervised sequence tagging from noisy rules, In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The ThirtySecond Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020, pages 5570–5578, 2020.

21.

Hasim Sak , Andrew Senior

and Françoise Beaufays , Long short-term memory based recurrent neuralnetwork architectures for large vocabulary speech recognition, CoRR, abs/1402.1128, 2014. URL http://arxiv.org/abs/1402.1128

22.

Nancy Chinchor , Beth Sundheim , MUC-5 evaluation metrics, In Fifth Message Understanding Conference (MUC-5): Proceedings of a Conference Held in Baltimore, Maryland, August 25–27, 1993, 1993.

23.

Zhouhan Lin , Minwei Feng , Cicero Nogueira dos Santos , Mo Yu , Bing Xiang , Bowen Zhou and Yoshua Bengio , A structured self-attentive sentence embedding, In Proc. of ICLR, 2017b.

24.

Raia Hadsell , Sumit Chopra and Yann Le Cun , Dimensionality reduction by learning an invariant mapping, In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), 2, pages 1735–1742, IEEE, 2006.

25.

Melis , Gábor , Koisk , Tomá and Blunsom , P. Mogrifier lstm, 2019.

26.

Minh-Thang Luong , Hieu Pham and Christopher Manning

, Effective approaches to attentionbased neural machine translation, arXiv preprint arXiv:1508.04025, 2015.

27.

Nancy Chinchor and Beth Sundheim

, MUC-5 evaluation metrics, In Fifth Message Understanding Conference (MUC-5): Proceedings of a Conference Held in Baltimore, Maryland, August 25–27, 1993, 1993.

28.

Shuming Shi , Huibin Zhang , Xiaojie Yuan and Ji-Rong Wen , Corpus-based Semantic Class Mining: Distributional vs. Pattern-Based Approaches, COLING, 2010.

29.

Yan Song , Shuming Shi , Jing Li and Haisong Zhang , Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings.

30.

A Fuzzy Logic Model for Hourly Electrical Power Demand Modeling, Electronics 10(4) (2021), 448.

31.

SOFMLS: online self-organizing fuzzy modified least-squares network, IEEE Transactions on Fuzzy Systems 17(6) (2009), 1296–1309.

32.

Wavelet-Based EEG Processing for Epilepsy Detection Using Fuzzy Entropy and Associative Petri Net, IEEE Access 7 (2019), 103255–103262.

33.

Stability Analysis of the Modified Levenberg-Marquardt Algorithm for the Artificial Neural Network Training, IEEE Transactions on Neural Networks and Learning Systems, 2020. DOI: 10.1109/TNNLS.2020.3015200

34.

On the Estimation and Control of Nonlinear Systems With Parametric Uncertainties and Noisy Outputs, IEEE Access 6 (2018), 31968–31973.

35.

CNN based detectors on planetary environments: a performance evaluation, Frontiers in Neurorobotics 14 (2020), 85.