Abstract
It is challenging for machine as well as humans to detect the presence of emotions such as sadness or disgust in a sentence without adequate knowledge about the context. Contextual emotion detection is a challenging problem in natural language processing. As the use of digital agents have increased in text messaging applications, it is essential for these agents to provide sensible responses to its users. The present work demonstrates the effectiveness of Gaussian process detecting contextual emotions present in a sentence. The results obtained are compared with Decision Tree and ensemble models such as Random Forest, AdaBoost and Gradient Boost. Out of the five models built on a small dataset with class imbalance, it has been found that Gaussian Process classifier predicts emotions better than the other classifiers. Gaussian Process classifier performs better by taking predictive variance into account.
Introduction
People use their facial expression, vocal intonation, body language, physiological response and written text to convey their emotions. In online communication, most often emotional information is encrypted in text. In the absence of non-verbal reminders, writers become accustomed to the medium by infusing messages with emotion reminders either explicitly or implicitly to allow for more normal communication. With the increase in occurrence of emotional contents on the Web, particularly on social media and microblogs, automatic emotion detection in text is gaining significant consideration from researchers and business people who explore how emotions affect decision making, behaviors, and quality of life.
Automatic emotion detection requires natural language processing techniques to find emotions expressed in written discourse. Designing computers that have the ability to find the emotion expressed in text is an application in computational linguistics. Research in sentiment analysis provides a hopeful direction for fine-grained sentiment analysis of subjective content. In most of the research work, sentiment analysis research functions at a coarser level. Sentiment analysis is mostly aimed at recognizing the subjectivity or semantic position of a unit of text rather than a specific emotion. Often, finding closely how a person reacts emotionally towards a specific provocation does matter. For illustration, while fear and sadness are both negative emotions, distinguishing between them can be crucial. In the occurrence of a disaster, fear may be used to detect an onset of the disaster whereas sadness may be linked with later stages.
In real business applications, automatic emotion detectors can offer good insights into how a particular audience feels about a product, person, event or topic. People try innovative methods for evaluating user-generated content to study about consumer emotional responses toward their products, events and services. For example, automatic emotion detection systems used online product reviews to identify and track emotional responses toward their products and services. Automatic anger detection systems in customer service emails can be used by customer service representatives to identify angry customers quickly so that necessary actions can be taken immediately to reduce slow-destruction of customers. In market consumer analytics, automatic emotion detection systems offer businesses with non-invasive tactics to sell and advertise their offerings better to their customers.
There is an increasing demand to better emotion sensitive systems that can recognize and express emotions to improve human-computer interactions [45, 4, 1]. An automatic emotion detection system is an important module in developing expressive conversational agents [64], textual emotion sensing system [24, 28, 30, 36] and intelligent user interfaces [32]. Since different languages are used to express emotions, detecting emotions in text is a challenging due to the complexity of language. Our knowledge of emotion signs present in text may be restricted by our cultural backgrounds. Automatic emotion detection systems, on the other hand, could be trained to find clear emotion signs that are widely used as well as less evident signs used by different folks, groups or cultures.
Emotion detection in text is focused at analyzing how people express their emotions through text. Emotions can be categorized as surprise, happiness, sadness, fear, anger, disgust and so on. In many cases, emotions are hidden behind the text, although the text may have a vibrant representation of emotions present in it. Extracting emotions from text based on keyword spotting, text mining, machine learning, semantics-based methods, and corpus-based methods is an active research area. However, the major challenge of the current systems is that they still lack the ability to learn and infer the emotions from text based on contextual information [18, 52].
The context of ongoing dialogue can completely change the emotion for an utterance as compared to emotion perceived when the utterance is evaluated alone [40, 15]. Table 1 shows a few such examples. In the first example, the last turn “Try to do that once” is very likely to be perceived as ‘neutral’ emotion. However, a majority will judge it as ‘angry’ with the given context. Similarly, in the second example, “I started crying” will be perceived as ‘sad’ by a majority; however, considering it in context, it turns out to be a ‘happy’ emotion.
We have demonstrated the effectiveness of Gaussian Process (GP) in detecting contextual emotions present in sentences. The results achieved by GP are compared against the results obtained by using decision trees, ensemble models such as random forest, AdaBoost and gradient boosting. We found that Gaussian Process classifier built on even a small dataset with class imbalance predicts emotions better than the other classifiers. GP classifier does not require large dataset to give good performance [11]. GP classifier also handles class imbalance in dataset by taking predictive variance into account [60].
Related work
Emotion analysis on social media is attracting more research attention from industry and academia. Commercial applications such as product recommendation, online retailing, and marketing are turning their interest from traditional sentiment analysis to emotion analysis [33, 50, 51]. There are several machine learning techniques that can be used for emotion intensity prediction, which include Artificial Neural Network (ANN) [35, 29, 7, 49], Random Forests, Support Vector Machine (SVM)[6], Naive Bayes (NB), Multi-Kernel Gaussian Process (MKGP) [8, 9], and Deep Learning (DL)[47, 27, 21].
Examples showing influence of context in determining emotion
Examples showing influence of context in determining emotion
Many existing automatic emotion detection systems use only recognizing single words contained within an emotion lexicon to construct the system. This concept that emotion is uttered using emotion words works well, to a certain level, in formal English text [58, 54]. However, considerably less is known about how emotions are uttered in microblog text. To boost the performance of automatic emotion detection systems in microblog text, researchers started focusing on recognizing emotions expressed in Twitter. Twitter, a microblogging site, is rich with tweets comprising how users feel about events, entities and topics discussed publicly on a global level. The text in Twitter can be analyzed to get insights about users’ perceptions, behaviors, and social communications between people of diverse interests in a non-invasive manner.
Attention to analyzing emotions on Twitter is shown by studies on how emotions expressed on microblogs affect stock market trends [13], speak about fluctuations in social and economic indicators [12], serve as a measure of the population’s level of joy [19, 48], give situational awareness for the authorities and the public in the event of disasters [59], and reflect clinical depression [43]. With the increase in number of tweets a day, automatic emotion detectors would significantly augment our ability to analyze and understand emotive content. It is infeasible to distinguish emotions expressed in millions of tweets through human effort because it is very labor-intensive and costly. Existing automatic methods can be classified into five main categories – lexicon-based methods, learning-based methods, manually constructed rules, knowledge-based methods and hybrid methods.
Lexicon-based methods is an easy method that uses a lexicon to detect emotions in text. This method is based on the assumption that individual words bear emotional coloring [41], and that emotions articulated in text can be sufficiently represented at the word level. This method is the earliest approach used for automatic emotion detection in text. This approach can be divided into two groups: supervised and unsupervised. Supervised learning approach uses marked up training data with pre-defined labels. Unsupervised learning approach uses similarity between data points to find if they can be characterized as belonging to a cluster. The facility to take into account contextual information and to capture emotional cues in segments longer than a word makes learning-based methods appealing for handling text with more nuanced emotional coloring.
Supervised machine learning approaches are more common than unsupervised approaches for automatic emotion detection in text. A human-annotated corpus is needed to first train and evaluate a machine learning model. With the help of corpus, the machine learning algorithm learns patterns associated with different emotion categories. Text data is segmented into sentences, but the size of a text segment is related to the unit of analysis determined by the researchers. In binary classification, a text segment is classified as either a positive or negative example of an emotion category. Determining if a text segment is emotional or non-emotional is an example of binary classification [3]. For sentences containing more than one emotion, researchers have either included them in a separate category labeled as “mixed emotions” [5] or allowed multiple labels to be assigned to each sentence (i.e., multi-label classification problem) [3]. The features such as bag of-words (BoW) and word n-grams are found to be popular features for emotion detection in text. BoW has been proven to be a successful feature set in sentiment analysis [42, 53, 31]. In machine learning algorithms, Support Vector Machines (SVMs) are popular for this problem space as they can scale to a large number of features and can do better than other classifiers for text classification [61]. Chaffar and Inkpen [14] showed that SVMs performed and generalized well on unseen data in emotion classification.
Unsupervised learning methods have been used recently to detect emotions that are expressed implicitly in text. One such famous unsupervised learning method in this problem domain is Latent Semantic Analysis (LSA). Strapparava and Mihalcea [55] assessed the semantic similarity among the terms in a given text and emotion concepts using a variation of Latent Semantic Analysis (LSA). LSA allows vectors containing emotion words, their synonyms or synsets and document vectors containing generic terms to be mapped into a concept. Of the five approaches tried by Strapparava and Mihalcea [55], LSA approach achieved relatively higher recall and F-score than lexicon-based and supervised learning-based approaches but the worst precision. Zhang [64] also used LSA to perform emotion processing of an intelligent agent in a role-playing virtual drama application. LSA has also been employed to detect emotion in Amazon customer reviews by Ahmad and Laroche [2].
Manually built rule-based method uses rules to decide if a text segment contains an emotion or not. Initially, rules are created manually from an initial data set. Researchers have to scrutinize sample text to look for grammatical patterns connected with each emotion category or derive patterns based on a theoretical framework. These patterns are manually converted into a list of rules which act as the basis for a rule engine or inference engine. Rules need not be limited to lexical cues (e.g., keywords) in text, but can also deal with more complex syntactic and semantic structures of a sentence. Syntactic and semantic information is obtained by examining texts through a parser. Zhe and Boucouvalas [63] constructed syntactic rules to include only emotion words expressed in first person form, took into account present continuous and perfect continuous tense as an indicator of emotion intensity, and excluded conditional sentences in an Internet chat environment. Donath et al. [20] set up rules to detect phrases in all capital letters, excessive punctuations, and profanities to find the anger present in a conversation. In processing news titles, Chaumartin [16] used syntactic rules to find the subject of the news title, as well as to find differences and accentuations between good news and bad news. Liu et al. [32] framed four rules to represent affective commonsense sentences from the Open Mind Commonsense Corpus. Neviarouskaya et al. [38] proposed a rule-based approach that can process sentences in five stages according to the different unit of analysis. Symbols and abbreviations were processed first, and then word, phrase, and sentence-level analyses were done. The strong point of the manually defined rules lies in its more transparent representation of emotion patterns in text, at least for relatively small rule sets. Explanations can be created for most instances captured by the rules because each rule pattern is clearly defined. However, it is impossible to find instances of emotions not defined by any of the rules. Most often, only a limited number of rules are defined to capture the obvious and non-ambiguous patterns. Lack of generalizability of rules is also a cause for concern.
The ontology-based method is centered on the generation of a machine-readable formal representation of human emotions. Ontology is an explicit specification of conceptualization for a particular domain [26]. This structural representation includes a domain vocabulary, descriptions of concepts and attributes, as well as the relations between concepts. Unlike lexicons, ontologies do not operate on a word-level. Rather, they are defined in terms of high-level concepts. Concepts are connected through taxonomic relations and semantic relations. Motivation for researchers to implement this method mainly stemmed from the lack of agreement in how emotion is defined in the research community. Proponents of the ontology-based approach aim to define a standard set of descriptors that can help reduce the ambiguity in interpretation of emotion expressed in text. Ontology-based methods are concerned with the creation, modification, and testing of emotion ontologies.
One of the earliest attempts to build an emotion ontology came from Grassi [25] who defined only high-level emotion concepts and properties in the Human Emotions Ontology (HEO). Shivhare and Khethawat [56] proposed a simple emotion ontology based on Parrot’s emotion word hierarchy [44]. Emotion ontologies can also be built based on common sense knowledge. Grounded on appraisal theories, Balahur et al. [10] modeled situations as “action chains” and their corresponding emotion using an ontology representation. Although ontologies provide some form of consistency on the knowledge of emotion, extensive efforts are needed to build a consistent one.
Hybrid method works by combining at least two of the four main methods used for emotion detection in text: lexicon-based, learning-based, manually constructed rules, and ontology-based. A hybrid approach aims to strategically control the strengths of diverse selected methods in an integrative framework. A combination of keyword spotting for emotion estimation of words and a set of rules for emotion estimation of sentences was used to build a textual emotion prediction system by Ma et al. [34] to chat with the animated agent. In 2011, i2b2/VA/Cincinnati conducted Medical Natural Language Processing Challenge to assign emotions to suicide notes. Many proposed systems were designed using hybrid methods [46]. In 2012, Yang et al designed a voting-based system to pick emotions for each sentence based on outputs from a mixture of keyword spotting, Conditional Random Field, and supervised machine learning methods. Nikfarjam et al. [39] first used rules to filter out sentences with obvious emotional cues and passed the uncertain cases to a supervised machine learning model for a final decision to solve the same problem. It was concluded by Sohn et al. [57] that the combination of manually constructed rules and supervised machine learning approaches resulted in better performance compared to using rules or machine learning alone. Hybrid methods provide a good solution by combining the strengths of one approach to overcome the weaknesses of another approach. Thus these methods are creating more optimal and efficient automatic emotion detectors. Finding out which combination of approaches work optimally together remains a challenge for the research community.
Although the broad topic of emotion has been studied in different fields for decades, study of contextual emotion detection in text is in its early stages. A semantic network for contextual emotion detection was developed by Chuang [18], but the size of the corpus is too small to support the results.
Gaussian process
Gaussian Process (GP) is a supervised learning method used for solving regression and classification problems [60]. Gaussian process has the following advantages:
For regular kernels, prediction interpolates the observations. Prediction is probabilistic so that one can compute empirical confidence intervals, and based on it refitting can be done. Gaussian Process is versatile since different kernels can be specified. Common kernels are provided, but it is also possible to specify custom kernels. GP works well for small datasets also.
A Gaussian Process is a collection of random variables, any finite number of which has joint Gaussian distributions. A Gaussian process is fully specified by its mean function
and read as “the function
A Gaussian Process Classifier implements Gaussian processes (GP) for probabilistic classification where test predictions take the form of class probabilities. Gaussian Process Classifier places a GP prior on a latent function which is then squashed through a link function to obtain the probabilistic classification. The latent function is a function whose values is not observed and is not relevant by itself. Its purpose is to allow a convenient formulation of the model and is removed during prediction. Gaussian Process Classifier implements the logistic link function. In Gaussian Process Classification (GPC), we place a GP prior over a latent function
Inference is divided into two steps: first, we compute the distribution of the latent variable corresponding to a test case
where
integral analytically intractable. Therefore, we need an analytical approximation of integrals. We can approximate the non-Gaussian joint posterior with a Gaussian one, using Expectation Propagation (EP) method. EP, however, uses the probit likelihood given by Eq. (4)
which makes the posterior analytically intractable. To overcome this hurdle in the EP framework, the likelihood is approximated locally in the form of an unnormalized Gaussian function in the latent variable
The posterior
A practical implementation of Gaussian Process Classification (GPC) for binary class [60] is outlined in Algorithm 3.1.1:
InputInput OutputOutput
ṽ,
Decision Tree (DT) is a non-parametric supervised learning method used for classification and regression. A model is created that predicts the value of a target variable by learning simple decision rules inferred from the data features. A DT is constructed by a recursive partition of the instance space. The decision tree consists of nodes that form a rooted tree. In a DT, each internal node splits the instance space into two or more sub-spaces according to a certain discrete function of the input attribute values as given in Algorithm 3.2.
InputInput OutputOutput find the best split for a decision tree [21]
Impurity
Impurity
where
In the simplest and most frequent case, each test considers a single attribute such that the instance space is partitioned according to the attribute’s value. In the case of numeric attributes, the condition refers to a range. Each leaf is assigned to one class representing the most probable target value. Alternatively, the leaf may hold a probability distribution over the target attribute. Instances are classified by navigating them from the root of the tree down to a leaf, according to the outcome of the tests along the path.
Random Forest Classifier is an ensemble model, and has been found to outperform several other methods. Yet, this model comes at the cost of increased model complexity. It is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and to control over-fitting. As the algorithm is very flexible and easy to use, it can be used for classification and regression. The most important feature can be found easily using this algorithm. Random Forests split the given dataset into random subsets of data samples. A decision tree is built on each random data sample separately. Each decision tree gives its prediction and the best solution is selected by voting or averaging. As the number of trees increases, the robustness of the random forest increases.
InputInput OutputOutput builds forest of decision trees on each bootstrap samples separately and ensemble them [21]
i = 1 Y Create a bootstrap sample
The random forest algorithm is unbiased because there are multiple trees and each tree is trained on a subset of data. The random forest algorithm relies on the strength of “the data sample”; therefore, the overall biasedness of the algorithm is reduced. When a new example is introduced into the dataset, it is selected by some trees, and hence it will affect only those trees and the overall RF is affected. It works well for both categorical and numerical features. It performs well even if there are missing values in the given dataset. The steps used to create a Random Forest model are depicted in Algorithm 3.3.
AdaBoost Classifier
Like Random Forest, AdaBoost (Adaptive Boosting) is another ensemble classifier. It combines a set of weak classifiers to form a strong classifier, by assigning appropriate weights to each weak classifier.
InputInput OutputOutput
Given
Initialize
Adaptive_Boosting_Classifier [23]
The weight given to each classifier depends upon the accuracy achieved. Each weak classifier is trained using a sample set of training data. Each sample has a weight, and the weights of all samples are adjusted iteratively. The main objective is to give more importance to the instances that are hard to classify. At first, all instances are assigned the same weight. In each iteration, weights of all incorrectly classified instances are increased and those of correctly classified instances are decreased.
AdaBoost algorithm is given in Algorithm 3.4. In this algorithm
Gradient boost classifier
The gradient-descent based boosting methods was derived by Friedman et al with proper statistical framework [37, 22]. This formulation of boosting methods and the corresponding models were called the gradient boosting machines.
InputInput OutputOutput Data points
Initialize
This framework also provided essential justification of the model’s hyper-parameters. In this boosting approach, a new model is fit by learning again on misclassified instances. The new base-learners are learned with the aim to maximally correlate the data with the negative gradient of the loss function, related to the entire ensemble model. If the error function is the classic squared-error loss, the learning procedure would result in consecutive error fitting. We can randomly choose the loss function and the base learner models based on the requirement. The solution to the parameter estimates is difficult to find when we are provided with a some specific loss function
We can select the new function boost increment to be the most correlated value with negative gradient
Semeval 2019 Task 3: EmoContext’s dataset was taken for experimental purpose. The dataset consists of 30160 training samples and 2755 development samples [17]. Each example consists of conversation id, three turns of conversed sentences and one of the contextual emotions such as happy, sad, angry, and neutral as label. Out of the 30160 training examples, 700 examples were used for building the model and 300 examples were used to test the model. Only 700 examples were taken, because we intended to study how well GP works for small datasets also. Gaussian process does not require large dataset to give good performance; it can provide good performance even with small dataset [60]. The experimentation on joint emotion analysis via GP demonstrates the performance of GP with small dataset [11].
InputInput OutputOutput Input dataset. Tokenized words and their parts of speech
Split the labels and sentences. Perform tokenization using word_tokenize function of NLTK toolkit. Perform Parts of Speech tagging using pos_tag function from NLTK toolkit. Return the tokenized words and their parts of speech as inputs to rule based feature selection.
Data extraction and Preprocessing The system is composed of data extraction, pre-processing, rule-based feature selection, and feature vector generation using Bag of Words and learning the models. The steps involved in data extraction and data preprocessing are outlined in Algorithm 4. Algorithm 4 lists out rule based feature selection and feature vector generation. The output obtained in Algorithm 4 is given as inputs to GP, DT, RF, AB and GB classifiers to learn the model.
InputInput OutputOutput Tokenized words and their parts of speech. Feature vector.
tokenized word falling under one of the categories listed in Table 2 Lemmatize the word using WordNet Lemmatizer from the NLTK toolkit. Insert the lemmatized word into the dictionary. Represent each sentence as a feature vector using one-hot encoding by looking up the dictionary. Return the feature vector generated as the input to build the model.
Rule based feature selection and feature vector generation
Performance evaluation and discussion
We evaluated the system using GP and the performance of GP was compared with contextual emotion detection done using DT, RF, AB and GB[6]. The results obtained using GP, DT, RF, AB and GB classifiers are tabulated in Table 3 which shows the accuracy, precision, recall, F1-score, specificity and Balanced Class Rate (BCR). From Table 3, we can infer that GP classifier predicts contextual emotion better than the other four classifiers. We could see that GP classifier has better accuracy, precision, recall, F1-score and BCR when compared to the other models. We can also see that, though GP and GB have similar BCR values, GP has better precision, recall and F1-score. Out of the three ensemble models, GB performs the best. It is found that all five models perform comparably well in terms of specificity. However, GP has better BCR since it takes into account the predictive variance.
Parts of speech categories
Parts of speech categories
Performance comparison
Accuracy and F1-score are defined in Eqs (8) and (9) respectively.
where
where
The need for contextual emotion detection was highlighted in this paper. The contextual emotion detection was implemented using GP, DT, RF, AB and GB models. We used rule-based feature selection and one-hot encoding to generate input feature vectors for building the models. The results obtained using GP classifier was compared to those using DT classifier and ensemble classifiers. The GP classifier built with small data sample has outperformed other classifiers in terms of accuracy, precision, recall, f1-score and BCR. It was observed that GP was balanced in predicting all the classes by taking into account the effect of uncertainty. Although GB classifier matches GP in BCR, GP is better than GB in terms accuracy, precision, recall and F1-score. Overall, GP was found good at handling datasets, though small and with class imbalance. The system can be further improved by using other feature selection methods and by incorporating sentiment lexicons.
