Sentiment analysis of text based on three-way decisions

Abstract

In recent years, affective computing has received much attention in the area of natural language processing and artificial intelligence. Sentiment orientation recognition of text is one of important parts in affective computing. A method is proposed to recognize the multi-label sentiment orientations of Chinese text based on three-way decisions. Firstly, sentiment orientation and intensity of sentiment words from texts are identified by sentiment lexicons, Tongyi Cilin and HowNet. Subsequently, sentiment orientation of text are divided into three domains, including positive, negative and boundary domain, according to their sentiment intensity and the appropriate decision-making thresholds. Lastly, sentiment orientations of texts in the boundary domain are distinguished pursuant to sentimental characteristics of sentences in texts. The results of experiments show that the method of multi-label sentiment analysis of Chinese text, based on three-way decisions, is effective for identifying sentiment orientations of texts.

Keywords

Sentiment analysis three-way decisions sentiment lexicon multi-label sentiment recognition

1 Introduction

The dramatic development of Internet technology is rapidly changing the way of daily interpersonal communication. Personal blog, microblog, product comment and news comment, etc., has generated lots of online information with personal subjective feelings [1 –3]. Most of the online information in texts tells something about users’ personal view, attitude and emotion, including pleasure, anger, sorrow and joy, reflecting people’s sentimental characteristics and sentimental changes [4 –7]. So, text is no longer merely used to describe objective facts, but is more focused on expressing private opinions and feelings, thus text sentiment analysis technology is promoted and become a research hotspot in the field of artificial intelligence and natural language processing.

Sentiment analysis of text is meant to make a judgment on the sentiment orientation of the words, sentences and text through mining and analyzing the opinions, views, emotions, and other subjective information revealed in the text. Sentiment analysis of text is an important part of affective computing [8], which has posed a new challenge to natural language processing. According to the difference of objects, sentiment analysis can be classified into the sentiment analysis of word and phrase, the sentiment analysis of sentence and the sentiment analysis of text, three research levels from low to high [9]. This paper carries out sentiment analysis of Chinese text at these three levels based upon sentiment lexicon and sentiment orientation of sentences with topic feature, to identify the multi-label sentiment orientation of Chinese texts.

Sentiment analysis of text is a higher-grade form of expression in affective computing, for which there are two major research methods: supervised learning and unsupervised learning [10 –14]. The method of unsupervised learning is to judge the sentiment category of text in accordance with the sentiment information of words or phrases in the text. Turkey [15] introduced a semantic orientation-based unsupervised method, and classified review articles according to the tendentious information of the commendation and derogation of word. Supervised learning method is a machine learning method by which texts are put under different sentiment categories. Pang et al. [16] first applied machine learning method to the sentiment classification of a text and made a comparison in three classification models as NB, ME and SVM. Naive Bayes classifier is a simple probabilistic classifiers based on applying Bayes’ theorem. ME states that the probability distribution which best represents the current state of knowledge is the one with largest entropy. SVM is a supervised learning model and can efficiently perform a non-linear classification. Jesus Serrano Guerrer et al. [17] review and compare some free access web services, analyzing their capabilities to classify and score different pieces of text with respect to the sentiments.

In China, Xu Linhong et al. [18] proposed an automatic identification mechanism that embraces semantic features and machine learning for Chinese text polarity. Xu Jun et al. [19] researched the sentiment classification of news and comments using Naive Bayes Method and Maximum Entropy Method, and summed up the superiority and inferiority of each method through a series of experiments. Wang Suge [20] proposed a text vector representation model with strength of sentiment orientation by the use of the data representation model in rough set theory, constructing a weighted rough membership function, and applied it to sentiment classification of Chinese text. Fuji Ren et al. [21 –23] proposed some methods by using sentiment topic features to recognize the sentiment orientation of Chinese text over different level, such as words, sentences and documents.

In light of above problems, this paper brings forth a method for identifying sentiment orientations of Chinese texts based on three-way decisions. Sentiment orientations of texts are identified in two stages. First of all, they are classified into acceptance, rejection and delay according to affective characteristics of words. Next, sentiment orientations of Chinese texts delaying decision-making are further identified based on affective characteristics of sentences. Sentiment orientations of Chinese texts are judged by making full use of affective features of words and sentences, so as to reflect connections of these three levels (i.e. words, sentences and texts). The experimental result shows that sentiment analysis of text based on three-way decisions is a satisfactory one.

The structure of this paper is shown as follows: the second part presents a brief introduction of related theory; the third part gives a detailed description of sentiment orientation analysis of text based on three-way decisions; the fourth part describes the experimental process and the result analysis; the final part makes a summary of the full paper.

2 Related theory

2.1 Rough set theory

In 1982, Professor Pawlak, as a Polish scholar, put forward the rough set theory [24]. According to this theory, knowledge is represented and processed based on information forms. The knowledge of research objects is described according to their attributes and their values.

Definition 1. (Approximate Space) Assuming U is a nonempty finite set and R is an equivalence relation defined for U, K = (U, R) may be referred to as approximate space.

Definition 2. (Binary Indiscernible Relation) Information system is defined as IS = 〈 U, A, V, f 〉 , U is a nonempty finite object set, A is an nonempty attribute set, V is a value set of attributes, f is a function of U × A → V . Provided that K = (U, R) , R ⊆ A (R ≠ ∅) , the binary indiscernible relation IND (R) is as follows:

$\begin{matrix} IND (R) & = & {(x, y) \in U \times U | \forall a \in R, \\ f_{a} (x) = f_{a} (y)} \end{matrix}$ (1)

Then, IND (R) is an equivalent relation, a division at U and denoted as U/IND (R).

Definition 3. (Positive, Negative and Boundary domains) Assuming that X ⊆ U, R ⊆ A and R≠ ∅ for a given information system IS =〈 U, A, V, f 〉, the positive, negative and boundary domains of the set X may be defined respectively as follows: ${POS}_{R} (X) = {\underline{X}}_{R}$ (2) ${NEG}_{R} (X) = U - {\bar{X}}_{R}$ (3) ${BND}_{R} (X) = {\bar{X}}_{R} - {\underline{X}}_{R}$ (4)

As a positive domain, POS_R (X) is a set of objects that may be definitely included in the set X according to knowledge R . The negative domainNEG_R (X) is a set of objects that may be definitely excluded from the set X according to knowledge R . In a sense, the boundary domain BND_R (X) is an uncertain domain and means the classification that can’t be clearly defined according to knowledge R is within or beyond the scope of X .

2.2 Three-way decisions model

Based on Pawlak’s classical rough set theory, Yao et al. [25, 26] proposed the rough set theory of decision making, made a semantic explanation on the basis of Bayes’ minimum-risk decision rule, depicted the probability domain by two thresholds and provided a practically effective method for threshold calculation. Based on decision acceptance or rejection, decision deferral is introduced into the three-way decisions model, thereby avoiding the losses from direct choice of decision acceptance or rejection.

For any object x ∈ U, it exists in two states, namely meeting or not satisfying given conditions. Thus, the set of objects U may be divided into two subsets, that is $U = X \cup \bar{X},$ where X is a set of objects meeting conditions and $\bar{X}$ is a set not in line with conditions. The action set A = {a_A, a_R, a_N}, which represents three actions respectively, including decision acceptance, rejection and deferral. The actions may contribute to different losses, which are represented by λ_AP, λ_RP, λ_NP, for a_A, a_R, a_N taken by X respectively. λ_AN, λ_RN, λ_NN are used for representing losses of a_A, a_R, a_N when the objects don’t belong to X, meeting two requirements as follows:

Requirement 1: $0 \leq λ_{AP} \leq λ_{NP} \leq λ_{RP}, 0 \leq λ_{RN} \leq λ_{NN} \leq λ_{AN}$

Requirement 2: $\frac{λ_{RP} - λ_{NP}}{λ_{NN} - λ_{RN}} > \frac{λ_{NP} - λ_{AP}}{λ_{AN} - λ_{NN}}$

For any object x ∈ U, the three-way decisions rules are as follows:

Rule A (Acceptance): If P (X| [x]) ≥ α, then x ∈ POS (X).

Rule R (Rejection): If P (X| [x]) ≤ β, then x ∈ NEG (X).

Rule N (Deferral): If β < P (X| [x]) < α, then x ∈ BND (X).

Where, threshold parameters α and β are calculated as follows: $α = \frac{λ_{AN} - λ_{NN}}{(λ_{AN} - λ_{NN}) + (λ_{NP} - λ_{AP})}$ (5) $β = \frac{λ_{NN} - λ_{RN}}{(λ_{NN} - λ_{RN}) + (λ_{RP} - λ_{NP})}$ (6)

Concerning Rule A, when the probability of x ∈ X is higher than α, acceptable rules may be created for the positive domain, and x may be included in the positive domain of X . According to Rule R, rejected rules will be developed for the negative domain when the probability of x ∈ X is below β, and x will be categorized as a part of the negative domain in X. For Rule N, x will be included in the boundary domain of X in case that the probability of x ∈ X ranges between α and β, while decision deferral means temporarily, no decisions are made.

Three-way decisions are made at the minimum costs. By calculating probability and thresholds of their categories, objects are categorized into positive, negative and boundary domains accordingly, which correspond to decision acceptance, rejection and deferral respectively. Being effective for processing and classifying data to reduce wrong decisions, the three-way decisions model may increase the accuracy of classification.

3 Sentiment orientation analysis of text based on three-way decisions

Sentiment orientation of texts is determined by sentiment characteristics of the basic elements included in text. In other words, sentiment information of words and sentences decide sentiment orientation of texts. Therefore, the core idea of this paper is that sentiment orientation of text is recognized from two different levels: words and sentences, which the three-way decisions model is applied. The detailed analysis procedure is shown here below.

3.1 Sentiment orientation analysis of words based on semantic similarity

It is fundamental for analyzing sentiment orientations of texts by identifying sentiment orientations of words and their intensity. In this paper, a method based on Tongyi Cilin [27] and HowNet [28] to determine correlations between unknown words and seed words according to their synonymous relations and semantic similarities, in order to identify sentiment orientations of unknown words and their intensity.

3.1.1 Establishment of sentiment lexion

In this paper, Ren_CECps Chinese sentiment corpus [29, 30] is adopted in the experiment. After processing and labeling 1,487 Chinese blogs, 11,255 paragraphs, 35,096 sentences and 878,164 words are included in Ren-CECps.

In Ren_CECps, all language information of Chines texts associated with sentiment expressions are labeled by hands at three levels, including texts, sentences and words. The sentiment label at the level of words are essential for annotating the whole Chinese sentiment corpus. Specifically, orientation and intensity of sentiments as well as parts of speech are labeled for words and phrases.

All sentiments are divided into eight most basic categories, including surprise, sorrow, love, joy, hate, expect, anxiety and anger. Sentiment types and intensity of texts, sentences and words are represented by an 8-dimension sentiment vector as follows: $\vec{e} = (e^{1}, e^{2}, e^{3}, e^{4}, e^{5}, e^{6}, e^{7}, e^{8})$ (7)

The value of eⁱ ranges from 0.1 to 1.0 and indicates sentimental intensity of a basic type of sentiments among the eight categories mentioned above. In this paper, sentiment words are extracted from training sets of Ren-CECps to make up a multi-label sentiment lexicon.

3.1.2 Calculation of semantic similarities

To analyze sentiment orientations of words, synonymous relations between unknown words and seed words of sentiment lexicon are determined by using Tongyi Cilin. If the unknown words are not included in Tongyi Cilin, the semantic similarities between unknown words and seed words shall be further measured by HowNet. In case that the unknown words exist in neither the Tongyi Cilin nor HowNet, the sentiment orientations of these words may be identified by the naive Bayes method introduced here below.

All words included in Tongyi Cilin are arranged in line with a tree-shaped hierarchical structure. In this dictionary, vocabularies are divided into three categories, including 12 in the division, 97 in the group and 1,400 in the class. The structure of Tongyi Cilin is shown in Fig. 1 as follows.

Fig.1

Structure of tongyi cilin.

Definition 4. (Similarity) The path length of the common parent node of w₁ and w₂ in the hierarchical system of Tongyi Cilin is labeled as Spd (w₁, w₂).

Definition 5. (Dissimilarity) This reveals that two words w₁ and w₂ gradually moves upwards along their separate parent nodes in the hierarchical system of Tongyi Cilin, until they reach a common parent node. In this case, the path they pass by is the shortest and labeled as Dsd (w₁, w₂).

Based on above definitions, the semantic similarity of w₁ and w₂ may be conveyed by Formula 8 as follows. $SimC = \frac{2 \times Spd (w_{1}, w_{2})}{Dsd (w_{1}, w_{2}) + 2 Spd (w_{1}, w_{2})}$ (8)

In case that two words w₁ and w₂ meet following requirements:

w₁ and w₂ belong to a common category l (w₁) = l (w₂), the semantic similarity between w₁ and w₂ may be further represented by Formula (9) as follows: $\begin{matrix} {SimC}^{'} \\ = \frac{2 \times Spd (w_{1}, w_{2}) \times α}{Dsd (w_{1}, w_{2}) + 2 Spd (w_{1}, w_{2}) \times α + Dnd (w_{1}, w_{2})} \end{matrix}$ (9) $\begin{matrix} Dnd (w_{1}, w_{2}) \\ = \frac{Wdis (w_{1}, w_{2})}{Wt (Spd (w_{1}, w_{2}) + 1)} \times Cld (Spd (w_{1}, w_{2}) + 1) \end{matrix}$ (10)

Where, α is a control parameter, Dnd (w₁, w₂) is used for measuring differences between w₁ and w₂, and Cld (Spd (w₁, w₂) +1) is an empirical parameter that indicates similarity of meaning items between w₁ and w₂ (i.e. Cld (2) > Cld (3) > Cld (4) > Cld (5)).

When w₁ and w₂ don’t belong to a common category, or one of them isn’t included in Tongyi Cilin, it will be inadvisable to calculate their semantic similarity by Formula (9), but by the HowNet-based semantic computation method, as shown in Formula 11. $Sim (w_{1}, w_{2}) = max_{i = 1 \dots m, j = 1 \dots n} Sim (C_{1 i}, C_{2 i})$ (11)

3.1.3 Bayesian classification

Sentiment orientation of words may be determined by calculating semantic similarity of unknown and seed words. If the unknown words are not included in Tongyi Cilin and HowNet, it will be impossible to judge sentiment orientations and intensity of these words by measuring their semantic similarity. Under this situation, sentiment orientations of the unknown words may be identified by Bayes classifier.

Assuming that w is a sequence composed of several characters (c₁, c₂, ⋯ , c_n) and each character is characterized as a feature of the word, the probability of its sentiment orientation may be determined by the general expression for calculating the probability that all features belong to a type of sentiment orientations.

Definition 6. Prior probability of character features is defined as the probability of their appearance in different sentiment orientations, and calculated as follows: $P (c_{i} | e = k) = \frac{count (c_{i} = 1, e = k) + 1}{count (e = k) + 1}$ (12)

Definition 7. When any unknown word appears, the probability for decision attributes to belong to different types of sentiment orientations turns into the posterior probability of combing features of characters, represented as follows: $P (e = k | w) = \frac{P (e = k) \prod_{i} P (c_{i} = 1 | e = k)}{\sum_{k} P (e = k) \prod_{i} P (c_{i} = 1 | e = k)}$ (13)

Where, P (e = k) may be calculated as follows: $P (e = k) = \frac{count (e = k)}{\sum_{k} count (e = k)}$ (14)

Thus, the naive Bayes classifier may be conveyed as follows: $y = arg max_{k} \frac{P (e = k) \prod_{i} P (c_{i} = 1 | e = k)}{\sum_{k} P (e = k) \prod_{i} P (c_{i} = 1 | e = k)}$ (15)

3.2 Sentiment orientation analysis of sentence with topic features

3.2.1 Model for multi-labeled emotion topic

A further study of Ren_CECps Chinese sentiment corpus shows that there is an inseparable relationship between the sentiments of sentences and the topic features of words. So based on this relationship, a multi-labeled emotion topic model (MLETM) [31 –35] is proposed to identify the sentiment orientation of sentence. This model is shown in Fig. 2.

Fig.2

Multi-label emotion topic model (MLETM).

In Fig. 2, node denotes random variable, such as word node w, and directed edge describes the condition dependence between nodes, such as directed edge z → w. In the overall graphical model there are three types of variables: the categorical variable, the proportional variable and the observable variable. E, z and w represent nodes, known as categorical variables. To identify K - Class emotions of sentence in the text, we define K two-dimensional random variables E_dsk to represent whether Sentence s in Text d has a kth emotion. As the ith word in Sentence s in Text d, w_dsi is subject to a random distribution of φ → w, and also affected by Topic z and Emotion E. θ, η and φ are proportional variables, which respectively denote the prior probability of E, z and w. θ_d is a J-dimension vector, and each θ_dj refers to the prior probability of the jth topic in Text d. η is a K-dimension vector, which describes the prior probability of different sorts of emotions. φ is a K × J × N-dimension vector, which describes the prior probability of word. $φ_{kjt}^{1}$ represents the prior probability of Word w which embraces Topic j and Emotion k, and $φ_{kjt}^{0}$ represents the prior probability of Word w which embraces Topic j but no Emotion k. α, β and γ are three watch variables obtained from training set.

3.2.2 Probability assumption of MLETM

According to the definition of MLETM, the directed edge in the model describes the condition dependence between random variables, and in accordance with the condition dependence, we hypothesized the probability.

For each sentence in the text, let there be K emotion classifiers E_dsk, and suppose E_dsk is independent from each other, and it affects the probability distribution of Word w together with Topic z_di. Suppose Word w is subject to the classification distribution of Random Variable φ, and has condition dependence with Emotion E and Topic z; the formulas are shown below: $w_{dsi} | E_{dsk}^{1}, z \sim Categorical (φ_{E_{dsk} z_{di}}^{1})$ (16) $w_{dsi} | E_{dsk}^{0}, z \sim Categorical (φ_{E_{dsk} z_{di}}^{0})$ (17)

Support Topic z is a classification distribution of condition dependence variable θ, and the formula is shown below: $z_{di} \sim Categorical (θ_{d})$ (18)

Since kth emotion classifiers E_dsk are independent from each other, suppose E_dsk are subject to the Bernoulli distribution of Parameter η; the formula is shown below: $E_{dsk} \sim Bernoulli (η_{k})$ (19)

For random variable φ, let $φ_{kjt}^{1}$ and $φ_{kjt}^{0}$ respectively describe whether a word has some emotion. Both of them are subject to the Dirichlet distribution of Parameter β, and the formula is shown below: $φ_{kjt}^{1} \sim Dirichlet (β_{kjt}^{1})$ (20) $φ_{kjt}^{0} \sim Dirichlet (β_{kjt}^{0})$ (21)

K-dimension random variable η is the prior probability of binary class emotion classifier E_dsk. Suppose η follows Beta distribution, a conjugate distribution of the Bernoulli distribution of Parameter η, and the formula is shown below: $η_{k} \sim Beta (γ_{k}^{1}, γ_{k}^{0})$ (22)

3.2.3 Inference

According to graphical model theory, many potential random variables are used to describe the potential characters needing to be predicted; in accordance with the hypothesis on probability distribution and the observable variable, we could derive the value of these potential variables. In MLETM, the value of E_dsk in each sentence needs to be predicted, which describes the probability for Sentence s in Text d to possess Emotion k; meanwhile, this value has conditional dependence with other variables. The derivation formula is listed below: $\begin{matrix} p (E_{dsk} | w, z, E_{- dsk}; α, β, γ) \propto \frac{n_{k}^{1} + γ_{k}^{1}}{n_{k}^{0} + n_{k}^{1} + γ_{k}^{0} + γ_{k}^{1}} \\ \times exp (\sum_{i \in W_{ds}} log \frac{n_{{kz}_{di} w_{dsi}}^{1} + β_{{kz}_{di} w_{dsi}}^{1}}{\sum_{t} n_{{kz}_{di} t}^{1} + β_{{kz}_{di} t}^{1}}) E_{dsk} = 1 \end{matrix}$ (23) $\begin{matrix} p (E_{dsk} | w, z, E_{- dsk}; α, β, γ) \propto \frac{n_{k}^{0} + γ_{k}^{0}}{n_{k}^{0} + n_{k}^{1} + γ_{k}^{0} + γ_{k}^{1}} \\ \times exp (\sum_{i \in W_{ds}} log \frac{n_{{kz}_{di} w_{dsi}}^{0} + β_{{kz}_{di} w_{dsi}}^{0}}{\sum_{t} n_{{kz}_{di} t}^{0} + β_{{kz}_{di} t}^{0}}) E_{dsk} = 0 \end{matrix}$ (24)

Topic z_di describes the topic probability distribution of the ith word in Text d. The formula is listed below:

$\begin{matrix} p (z_{di} | w, z_{- di}, E; α, β, γ) \propto \frac{n_{{dz}_{di}} + α_{z_{di}}}{W_{d} + α^{*}} \\ \times \prod_{k \in K_{d}^{1}} \frac{n_{{kz}_{di} w_{di}}^{1} + β_{{kz}_{di} w_{di}}^{1}}{\sum_{t} n_{{kz}_{di} t}^{1} + β_{{kz}_{di} t}^{1}} \\ \times \prod_{k \in K_{d}^{0}} \frac{n_{{kz}_{di} w_{di}}^{0} + β_{{kz}_{di} w_{di}}^{0}}{\sum_{t} {n_{0}}_{{kz}_{{di}^{t}}} + β_{{kz}_{{di}^{t}}}^{0}} \end{matrix}$ (25)

3.3 Sentiment orientation of text

According to the “bag of words” hypothesis, a text is deemed as a set of sentiment words and phrases, which are weighted to identify sentiments. The vector space model of a text may be expressed as D = {w₁, w₂, …, w_n}, where n represents number of sentiment words and phrases, and w_i is the ith sentiment word or phrase. w_i is denoted by a 8-dimension sentiment vector $e_{i} = (e_{i}^{1}, e_{i}^{2}, e_{i}^{3}, e_{i}^{4}, e_{i}^{5}, e_{i}^{6}, e_{i}^{7}, e_{i}^{8})$ , so the vector space model of sentiment may be further conveyed as follows for a text. $D = {w_{1}, w_{2}, \dots, w_{n}, e_{1}, e_{2}, \dots, e_{n}}$ (26)

In order to facilitate weighting for identifying sentiment category of texts, Formula (26) is rewritten into Formula (27), which may be used for initially identifying sentiment category of text. $D = (\sum_{i = 1}^{n} e_{i}^{1} / n, \sum_{i = 1}^{n} e_{i}^{2} / n, \dots, \sum_{i = 1}^{n} e_{i}^{7} / n, \sum_{i = 1}^{n} e_{i}^{8} / n)$ (27)

The problem on how to distinguish multi-label sentiment orientations of texts is converted into a problem concerning identification of several binary sentiment orientations. Based on three-way decisions, an object d is categorized as certain sentiment category or not. It may be ascribed to that category or excluded from it. Hence, the state set is defined as: Ω = {E_k, ¬ E_k}, where E_k and ¬E_k mean the x belongs to or is beyond E_k. The action set is defined as: A = {a_A, a_R, a_N}, which represent acceptance, rejection and deferral respectively. Based on experiences, the loss functions are shown in Table 1 as follows.

Table 1

Loss functions for decisions of two states

Actions	Objective State
	Requirement P	Requirement N
Acceptance: A	λ_AP: 0	λ_AN: 7u
Rejection: R	λ_RP: 8u	λ_RN: 0
Deferral: N	λ_NP: 3u	λ_NN: 2u

According to Table 1, formulas 5 and 6, a pair of thresholds are calculated, α = 0.625 and β = 0.286. Thus, the decision rules are as follows for object d:

Rule A: If P (E_k|d) ≥ α, d ∈ POS (E_k).

Rule R: If P (E_k|d) ≤ β, d ∈ NEG (E_k).

Rule N: If β < P (E_k|d) < α, d ∈ BND (E_k)

P (E_k|d) is calculated as follows: $P (E_{k} | d) = \sum_{i = 1}^{n} e_{i}^{k} / n$ (28)

When the object d adopts deferral rules, it means the object d is impossible to directly judge the emotion E_k or not by weighing sentiment words. Then sentiment information of sentences in a text may be acquired by MLETM, and sentiment orientations of texts may be further discriminated based on sentiment characteristics of sentences.

For the object d delaying decision-making, a threshold θ is set according to sentiment characteristics of sentences, and handled as follows:

If the emotional equivalence class proportion of the sentences with Emotion E_k in Text x is equal or greater than θ, we judge that Text x has Emotion E_k.

If the emotional equivalence class proportion of the sentences with Emotion E_k in Text x is less than θ, we judge that Text x doesn’t have Emotion E_k.

The multi-label sentiment analysis framework of Chinese texts is illustrated in Fig. 3. Training process is in the left and testing process is in the right side. Moreover, sentiment lexicon, MLETM model and three-way decisions model are used to identify the multi-label sentiment orientation of Chinese texts. The judgment of the sentiment orientation of text falls into 6 steps, with more detail asfollows:

Fig.3

Multi-label text emotion analysis framework.

Step 1. Select 1000 files from Ren_CECps Chinese emotion corpus as experimental data, to constitute training corpus and test corpus;

Step 2. Pre-process training corpus and test corpus respectively, to eliminate a few sentences without any emotion, and remove the stop words out of the corpuses in accordance with the stop word lexicon, to establish a sentiment lexicon.

Step 3. Based on training data set, conduct training to acquire all parameters needed by MLETM, and compute the sorts of the sentences included in the training data set;

Step 4. According to the training data set, first use the sentiment lexicon to identify the emotion polarity of texts in the test data set;

Step 5. Re-identify the multi-label sentiment polarity of texts using three-way decisions method and MLETM if the emotional recognition fails to be identified clearly;

Step 6. Evaluate the recognition result.

4 Experiment and analysis

4.1 Experiment data

In this experiment, 1,000 blogs are randomly selected from Ren_CECps as experimental dataset, where each blog is tagged as a subset of 8 categories of sentiment (including surprise, sorrow, love, joy, hate, expect, anxiety and anger). The distribution of sentiment orientations of texts is shown inTable 2.

Table 2
Distribution of sentiment orientations for texts

Sentiment Orientation Number of Texts Percent (%)

Surprise 70 7.0

Expect 392 39.2

Joy 356 35.6

Sorrow 427 42.7

Hate 191 19.1

Anxiety 456 45.6

Love 564 56.4

Anger 120 12.0

Sentiment Orientation	Number of Texts	Percent (%)
Surprise	70	7.0
Expect	392	39.2
Joy	356	35.6
Sorrow	427	42.7
Hate	191	19.1
Anxiety	456	45.6
Love	564	56.4
Anger	120	12.0

Pre-processing of the data set: 1) remove a few sentences without any emotion out of data set; 2) Remove the stop words out of all the sentences. 3) 800 documents of the dataset make up a training set, while the testing set is composed of the remained 200 documents.

4.2 Standard of experiment evaluation

The experiment in the paper is aimed at recognizing the multi-label sentiment orientations of texts, and the experiment result is evaluated with a label-based evaluation method [36, 37]. For some single label k, formula M (tp_k, tn_k, fp_k, fn_k) is used to evaluate the classification result. In the formula, tp_k denotes correct identification of the number of the texts with emotion label k, tn_k denotes correct identification of the number of the texts without emotion label k, fp_k denotes false identification of the number of the texts with emotion label k, and fn_k denotes false identification of the number of the texts without emotion label k. The macro-average and micro-average formulas of multi-labeled classification are listed as follows: $M_{macro} = \frac{1}{| K |} \sum_{k = 1}^{| K |} M ({tp}_{k}, {fp}_{k}, {tn}_{k}, {fn}_{k})$ (29) $\begin{matrix} M_{micro} & = & M (\sum_{k = 1}^{| K |} {tp}_{k}, \sum_{k = 1}^{| K |} {fp}_{k}, \sum_{k = 1}^{| K |} {tn}_{k}, \sum_{k = 1}^{| K |} {fn}_{k}) \end{matrix}$ (30)

4.3 Experiment result

A method is developed for identifying sentiment orientations of texts according to affective features of words and sentences in combination with the thoughts of three-way decisions. The parameters of three-way decisions models may be set as: α = 0.625, β = 0.286 and θ = 0.5. All the above parameters are empirical values obtained from the training data set.

A multi-label sentiment recognition experiment on text is compared with Naive Bayes Method, sentiment lexicon method and the method based on three-way decisions respectively. Table 3 illustrates the macro-average and micro-average value in the three methods. The experimental result in Table 3 fully shows the superiority of the method based on three-way decisions in sentiment identification of texts.

Table 3
Comparison of multi-label sentiment recognition

Naive Sentiment Three-way

Bayes Lexicon Decisions

Method Method

Precision of Macro-average 0.655 0.509 0.712

Accuracy of Macro-average 0.521 0.613 0.751

Precision of Micro-average 0.655 0.508 0.728

Accuracy of Micro-average 0.554 0.636 0.778

	Naive	Sentiment	Three-way
Precision of Macro-average	0.655	0.509	0.712
Accuracy of Macro-average	0.521	0.613	0.751
Precision of Micro-average	0.655	0.508	0.728
Accuracy of Micro-average	0.554	0.636	0.778

As shown in Fig. 4, the accuracy for identifying six basic sentiment orientations (including love, sorrow, anxiety, surprise, angry and expect) is higher than the accuracy for identifying joy and hate. The accuracy for identifying hate is low, fewer tests with sentiment orientation of hate are collected in corpus. As a consequence, models are not adequately trained, thereby impacting the accuracy of identifying hate.

Fig.4

A comparison of 8 basic sentiment orientation.

8 categories of basic sentiment orientations are identified by three different methods and corresponding experimental results are also shown in Fig. 4, from which it may be found that the three-way decisions method is advantageous in identifying a majority of sentiment orientations.

4.4 Discussion

In this section, a discussion is made to evaluate the results of our experiments and find the factors which influence the result of the multi-label sentiment recognition of Chinese texts.

The good results of the experiments show that our method in recognizing the multi-label sentiment orientation of texts is better than Naïve Bayes and sentiment lexicon method in Table 3 and the accuracy of the single-label sentiment orientation of 8 basic category is over 50 percent in Fig. 4. But all of the above shows that the emotion of human beings is so complicated and the performance of experiments also have space to improve in the future. It inspires us to study new methods and find more meaning sentimental information to improve the performance of our model. It is another important factor that sentiment distribution is not balance in corpus, so that it influences the sentiment recognition of text and some sentiment orientations of text can’t be identified precisely, as shown in Fig. 4. It is a difficult and tough task for us in the future.

5 Conclusion

We made an intensive study on the problem of multi-label sentiment analysis and proposed a three-level (word, sentence and text) multi-label sentiment analysis method based on three-way decisions model. Ren_CECps Chinese emotion corpus is adopted for the experiment in the paper.

The method proposed in the paper for analyzing sentiment orientations of Chinese texts based on three-way decision identifies the sentiment orientations by fully taking advantage of affective features of words and sentences. The experimental results prove the superiority of this method. The method for analyzing sentiment orientations of texts based on three-way decisions takes affective features of words and sentences into account to jointly discriminate the orientations, which solve the problem regarding loss of sentiment information between levels. The risk cost is introduced into the decision making, so it is necessary to seek appropriate thresholds for making decisions, in order to adopt different decision making rules for texts.

At present, WeChat has become the most common online social media with great value for research and application in identifying sentiment orientations of texts. WeChat texts are mostly concise with fewer words and sentences, but rich sentiments, flexible structures and often special sentiment symbols of the internet. Therefore, it is somewhat challenging to identify diversified and complex sentiment orientations of WeChat texts by efficiently and accurately exacting affective features from them.

Footnotes

Acknowledgments

This research has been partially supported by National Natural Science Foundation of China under Grant No. 61432004, and JSPS KAKENHI Grant Number No. 15H01712.

References

and Liu

, Mining and Summarizing Customer Reviews, Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining, 2004, pp. 168–177.

Liu

and Zhang

, A Survey on Opinion Mining and Sentiment Analysis Mining text data, Springer, New York, 2012, pp. 415–463.

and Oh

A.H.

, Aspect and Sentiment Unification Model for Online Review Analysis, Proceedings of the 4th ACM International Conference on Web Search and Data Mining, 2011, pp. 815–824.

Ren

, Affective information processing and recognizing human emotion, Journal of Electronic Notes in Theoretical Computer Science (2009), 39–50.

and Ren

, Emotion recognition of weblog sentences based on an ensemble algorithm of multi-label classification and word emotions, IEEJ Transactions on Electronics, Information and Systems132 (2012), 1362–1375.

Ren

, From cloud computing to language engineering, affective computing and advanced intelligence, Journal of Advanced Intelligence2(1) (2010), 1–14.

Ren

and Quan

, Linguistic-based emotion analysis and recognition for measuring consumer satisfaction - an application of affective computing, Information Technology and Management13(4) (2012), 321–332.

Picard

R.W.

, “Affective Computing”, the MIT Press, Mass, 1997.

Yanyan

, Bin

and Ting

, Sentiment analysis, Journal of Software (2010), 1834–1848.

10.

Taboada

, Brooke

and Tofiloski

, Lexicon-based methods for sentiment analysis, Journal of Computational Linguistics37(2) (2011), 267–307.

11.

Strapparava

and Mihalcea

, Learning to identify emotion in text, Proceedings of the 2008 ACM Symposium on Applied Computing, 2008, pp. 1556–1560.

12.

Ravi

and Ravi

, A survey on opinion mining and sentiment analysis: Tasks, approaches and applications, Journal of Knowledge-Based Systems89 (2015), 14–46.

13.

Pradhan

V.M.

, Vala

and Balani

, A survey on sentiment analysis algorithms for opinion mining, Journal of Computer Application133(9) (2016), 7–11.

14.

Mohamed Hussein

D.M.E.-D.

, A survey on sentiment analysis challenges, Journal of King Saud University Engineering Sciences, April 2016. DOI: 10.1016/j.jksues.2016.04.002

15.

Turney

P.D.

, Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002, pp. 417–424.

16.

Pang

, Lee

and Vaithyanathan

, Thumbs up? Sentiment classification using machine learning techniques, Proc of the EMNLP 2002 Morristown: ACL, 2002, pp. 79–86.

17.

Serrano-Guerrero

, Olivas

J.A.

, Romero

F.P.

and Herrera-Viedma

, Sentiment analysis: A review and comparative analysis of web service, Journal of Information Science (2015), 18–38.

18.

Linhong

, Hongfei

and Zhihao

, Text orientation identification based on semantic comprehension[J], Journal of Chinese Information Processing21(1) (2007), 96–100.

19.

Jun

, Yuxin

and Xiaolong

, Sentiment classification for chinese news using machine learning methods[J], Journal of Chinese Information Processing21(6) (2007), 95–101.

20.

Suge

, Research of sentiment classification Based on the Web Comments, 2008.

21.

Ren

and Matsumoto

, Semi-automatic creation of youth slang corpus and its application to affective computing, IEEE Transactions on Affective Computing7(2) (2015), 176–189.

22.

Ren

, Kang

and Quan

, Examining accumulated emotional traits in suicide blogs with an emotion topic model, IEEE Journal of Biomedical and Health Informatics20(5) (2015), 1384–1396.

23.

Quan

and Ren

, Feature-level sentiment analysis by using comparative domain corpora, Enterprise Information Systems10(5) (2014), 505–522.

24.

Pawlak

, Rough sets, International Journal of Information and Computer Science11(5) (1982), 314–356.

25.

Yao

Y.Y.

, An outline of a theory of three-way decisions, Proceeding of the RSCTC 2012, LNCS(LNAI)7413, 2012, pp. 1–17.

26.

Yao

Y.Y.

, Three-way decisions with probabilistic rough sets[J], Journal of Information Science180 (2010), 341–353.

27.

Lin

, An information-theoretic definition of similarity, Proceedings of the 15th International Conference on Machine Learning, Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, 1998, pp. 296–304.

28.

Andreevskaia

and Bergler

, Mining WordNet for a fuzzy sentiment: Sentiment tag extraction from WordNet glosses, Proceedings of the 11st Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2006, pp. 209–216.

29.

Quan

and Ren

, A blog emotion corpus for emotional expression analysis in Chinese, Journal of Computer Speech & Language24(4) (2010), 726–749.

30.

Ren

, Document for Ren-CECps 1.0, http://a1-www.is.tokushima-u.ac.jp/member/ren/Ren-CECps1.0/Ren-CECps1.0.html, 2009.

31.

Blei

D.M.

, Ng

A.Y.

and Jordan

M.I.

, Latent Dirichlet allocation[J], Journal of Machine Learning Research (2003), 993–1022.

32.

Das

and Bandyopadhyay

, Extracting Emotion Topics from Blog Sentences: Use of Voting from Multi-engine Supervised Classifiers, Proceedings of the 2nd International Workshop on Search and Mining User-Generated Contents, 2010, pp. 119–126.

33.

Moghaddam

and Ester

, On the Design of LDA Models for Aspect-based Opinion Mining, Proceedings of the 21st ACM International Conference on Information and Knowledge Management, 2012, pp. 803–812.

34.

Sun

, Zhou

and Wei

, Unsupervised topic and sentiment unification model for sentiment analysis, Journal of Acta Scientiarum Naturalium Universitatis Pekinensis49(1) (2013), 102–108.

35.

Wang

, Ren

and Miao

, Multi-label emotion recognition of weblog sentence based on bayesian networks, IEEJ Transactions on Electrical and Electronic Engineering (TEEE) (2015), 1–7. DOI: 10.1002/tee.22204

36.

Tsoumakas

and Katakis

, Multi-label classification: An overview[J], International Journal of Data Warehousing and Mining3(3) (2007), 1–13.

37.

Tsoumakas

and Vlahavas

, Random K-Labelsets: “An Ensemble Method for Multilabel Classification”, Proceedings of the 18th European Conference on Machine Learning(ECML2007), Warsaw, Poland, 2007, pp. 406–417.

Sentiment analysis of text based on three-way decisions

Abstract

Keywords

1 Introduction

2 Related theory

2.1 Rough set theory

3.1 Sentiment orientation analysis of words based on semantic similarity

3.1.1 Establishment of sentiment lexion

3.2.1 Model for multi-labeled emotion topic

4.1 Experiment data

Table 2 Distribution of sentiment orientations for texts Sentiment Orientation Number of Texts Percent (%) Surprise 70 7.0 Expect 392 39.2 Joy 356 35.6 Sorrow 427 42.7 Hate 191 19.1 Anxiety 456 45.6 Love 564 56.4 Anger 120 12.0

Table 3 Comparison of multi-label sentiment recognition Naive Sentiment Three-way Bayes Lexicon Decisions Method Method Precision of Macro-average 0.655 0.509 0.712 Accuracy of Macro-average 0.521 0.613 0.751 Precision of Micro-average 0.655 0.508 0.728 Accuracy of Micro-average 0.554 0.636 0.778

5 Conclusion

Footnotes

Acknowledgments

References

Table 2
Distribution of sentiment orientations for texts

Sentiment Orientation Number of Texts Percent (%)

Surprise 70 7.0

Expect 392 39.2

Joy 356 35.6

Sorrow 427 42.7

Hate 191 19.1

Anxiety 456 45.6

Love 564 56.4

Anger 120 12.0