Abstract
Social media provides unprecedented opportunities for people to disseminate information and share their opinions and views online. Extracting events from social media platforms such as Twitter could help in understanding what is being discussed. However, event extraction from social text streams poses huge challenges due to the noisy nature of social media posts and dynamic evolution of language. We propose a generic unsupervised framework for exploring events on Twitter which consists of four major steps, filtering, pre-processing, extraction and categorization, and post-processing. Tweets published in a certain time period are aggregated and noisy tweets which do not contain newsworthy events are filtered by the filtering step. The remaining tweets are pre-processed by temporal resolution, part-of-speech tagging and named entity recognition in order to identify the key elements of events. An unsupervised Bayesian model is proposed to automatically extract the structured representations of events in the form of quadruples
Introduction
Newsworthy events describe what has happened around the world and might directly or indirectly affect everyone in the world. With the fast development of social media platforms, newsworthy events are widely scattered not only on traditional news media but also on social media. For example, Twitter, one of the most widely adopted social media platforms, appears to cover nearly all newswire events. As has been previously reported in [18], even 1% of the public stream of Twitter contains around 95% of all the events reported in the newswire. Furthermore, social networking sites allow people sharing their thoughts and opinions towards a wide range of events. Hence, it is possible to understand general public’s reactions to events from social media stream data, which can facilitate downstream applications such as tracking public’s viewpoints.
Therefore, it is crucial to extract events from social streams such as tweets. Events have been represented in different ways for different purposes. In Automated Content Extraction Program (ACE), an event is represented as a ‘6W’ tuple (Who did What to Whom, Where and When, through What methods and Why) with a varying number of components depending on the system task (
Previous research in event extraction focused largely on news articles. Event extraction techniques typically rely on the detection of event “triggers” with their arguments for slot filling in event frames. Classical approaches to event extraction can be roughly divided into three classes, pattern based [26], machine learning based [19] and a hybrid combining the previous two categories [11]. Compared to newswire text, the social stream data such as tweets have the following characteristics:
Noisy and informal writing styles. Social media messages are often short, contain a large number of irregular and ill-formed words, and evolve rapidly over time. Comparing to news articles, it is more challenging to process such fragmented and noisy messages. Also, most social media messages are not event-related.
Unknown event types. Social media data are produced continuously by a large and uncontrolled number of users. As such, it is not possible to know the event types a priori and hence makes it hard to apply the existing event extraction approaches which either rely on manually-defined linguistic patterns representing expert knowledge to extract events or make use of corpora annotated with event-specific information such as actors, date, place, etc., to learn event extraction patterns.
Redundancy. For most newsworthy events, there may be a high volume of redundant messages referring to the same event.
The aforementioned characteristics of social stream data pose new challenges but also provide opportunities to employ unsupervised approaches for event extraction and categorization based on the redundancy property of event-related tweets. Recently there has been much interest in event extraction from Twitter. Ritter et al. [22] presented a system called TwiCal to extract and categorize events from Twitter. They relied on a sequence labeler trained from annotated data to extract event phrases from Twitter. In [1], a system called EvenTweet was constructed to extract localized events from a stream of tweets in real-time. The extracted events are described by start time, location and a number of related keywords. Instead of employing annotated data for event extraction, we have proposed an unsupervised Bayesian model called Latent Event & Category Model (LECM) for event extraction and categorization [28]. It is assumed that in the model, each tweet message
Examples of event-related tweets without temporal information
However, a careful examination of the tweets data reveals that most tweets do not include temporal expression (the date/time when the event occurred). Taking the collected tweets in the month of December in 2010 as an example, out of a total of 706,815 tweets after filtering, only 133,031 (less than 20%) of tweets contain temporal expressions. Table 1 provides some examples of tweets without temporal expressions. Simply assigning the event date with the corresponding tweets’ publishing date could result in one event being assigned with multiple dates and hence cause ambiguities. Table 3 shows some examples of tweets mentioning the same event but with different publishing date.
Definition of notations
Examples of tweets mentioning the same event but published in different dates
To tackle this problem, we propose to modify our LECM model by dropping the date element. The event date information is inferred more accurately based on combining heuristic rules with the outputs generated by LECM-d. As will be discussed in the Experiments section, our results on a large dataset consisting of over 60 million tweets show that our modified model significantly improves upon TwiCal by nearly 13.6% in precision and outperforms the original LECM model by 9.75% in precision. Moreover, events are also clustered into coherence groups with the automatically assigned event type label with an accuracy of 42.57%.
Our work is related to two lines of research, event extraction and event detection. Here, an event refers to something that happens at certain time and place. We distinguish between event extraction and detection in which event extraction aims at extracting structured information from text while event detection focuses on discovering new or previously unidentified events. In the following, we present a brief survey of related work in event extraction and event detection.
Event extraction
Event extraction has been largely studied on news articles. Methods proposed include machine learning, pattern-based and a hybrid of both. In [16], event extraction is considered as a clustering problem and two novel distance metrics are employed on heterogeneous news sources. In [8], specific patterns are designed and employed for biomedical event extraction. In [11], a combination of pattern matching and statistical modeling techniques is used. Two types of patterns are constructed including the sequence of constituent heads separating anchor and its arguments and a predicate argument subgraph of the sentence connecting anchor to all the event arguments.
In recent years, event extraction from tweets has received an increased interest. Focusing on entertainment events, Benson et al. [6] proposed a structured graphical model which simultaneously analyzes individual messages, clusters them according to event, and induces a canonical value for each event property. The method yields up to a 63% recall against the city table and up to 85% precision evaluated manually. Popescu [20] focused on detecting events involving known entities from twitter. Experimental results showed that events centered on specific entities can be extracted with 70% precision and 64% recall. Liu et al. [15] work on social events extraction for social network construction using a factor graph by harvesting the redundancy in tweets. Experiments were conducted on a human annotated data set and results showed that the proposed method achieved an absolute gain of 21% in F-measure. Li et al. [14] paid attention to personal major life events such weddings and graduation. A pipeline based system was constructed to extract a fine-grained description of users’ life events based on their published tweets.
Ritter et al. [22] presented a system called TwiCal to extract and categorize events from Twitter. The strength of association between each named entity and date based on the number of tweets they co-occur in is measured to determine whether the extracted event is significant. The approach achieved an increase in maximum F-measure over a supervised baseline. Anantharam et al. [3] focused on extracting and understanding city events. The problem is formulated as a sequence labeling problem. Evaluation was carried out on a real-world dataset consisting of event reports and tweets collected over four months from San Francisco Bay Area. Sandeep et al. [17] proposed algorithms to extract attribute-value pairs and map such pairs to manually generated schemas for natural disaster events. Evaluation was carried out on 58000 tweets for 20 events and the system can fill such event schemas with an F-measure of 60%.
Our work is similar to TwiCal in the sense that we also focus on the extraction and categorization of structured representation of events from Twitter. However, TwiCal relies on a supervised sequence labeler trained on tweets annotated with event mentions for the identification of event-related phrases. We propose a simple Bayesian modelling approach which is able to directly extract event-related keywords from tweets without supervised learning. TwiCal uses
Event detection
Instead of extracting structured representations of events, event detection aims to discover new or previously unidentified events. Event detection has long been addressed in the Topic Detection and Tracking (TDT) program sponsored by the Defense Advanced Research Projects Agency. The concept of event in event detection [2] is defined as real-world occurrence
Methodologies
Our proposed framework consists of four main steps, filtering, pre-processing, event extraction and categorization, and post-processing, as illustrated in Fig. 1. Table 2 lists notations used in this paper. Given a raw stream of Twitter, irrelevant or noisy tweets are filtered out firstly. Only tweets which are more likely describing events are kept and processed by temporal resolution, part-of-speech (POS) tagging and named entity recognition in the pre-processing step. Afterwards, a Bayesian model is proposed and employed for event extraction and categorization. Here, an event is represented as a tuple
The proposed framework for exploring event from Twitter.
Two approaches have been explored for filtering tweets. The first approach is through lexicon matching. By collecting news articles published around the same period as tweets, a lexicon is constructed by extracting keywords from these articles based on a measure such as TF-IDF (term frequency-inverse document frequency). Then, only the tweets containing words that can be found in the lexicon are kept.
Apart from the keyword-based approach, we have employed another feature based approach, which casts tweet filtering as a binary classification problem. Given a set of tweets
Binary word features. We select words occurred more frequently in event-related tweets but rarely in non-event tweets as highly class-indicative features to build our feature set. The importance score of a word is defined as TFP/TFN, where TFP is the term frequency in the event-related tweets while TFN is the term frequency in non-event tweets. We sort the words by their importance scores and only select the top
Other event-related features. We notice that tweets containing information related to authoritative news agencies such as CNN or BBC and some phrases such as “breaking news” most likely describe real-world events. As such, we also include binary features indicating the presence of news agencies and some manually selected indicative phrases. Furthermore, we add other binary features [25] which consist of time-related phrases, opinionated words, currency and percentage signs, URLs, reply to other users such as “@username”, etc.
Event elements. As an event is described as “something that happens at a given place and time”, the presence of named entity, location, and time information could be potentially useful to detect the occurrence of an event in text. Hence, they are also used as features to train a binary classifier.
In the proposed framework, an event is represented as a tuple of named entities, date, location, and event-related keywords. Therefore, it is crucial to identify date, location and named entities in Tweets. As Twitter users might represent the same date in various forms, SUTime (
We have proposed an unsupervised latent variable model called LECM to extract and cluster event instances [28]. It is assumed that in the model, each tweet message
However, after a close examination of the collected tweets data which will be discussed in more details in Section 4, we found that very few tweets contain temporal expression. Out of a total of 706,815 tweets after filtering, only 133,031 (less than 20%) of tweets contain temporal expressions. As such, for most tweets, their publish timestamps have been used to set the “date” element in the LECM model. However, tweets discussing a certain event could be published one or two days after the event actually happened. Also, LECM model allows event instances with similar keywords but different dates to be clustered into the same event. It results in the same event being associated with multiple dates. To tackle this problem, we propose to modify LECM by dropping the “date” element and called the modified model LECM-d. The graphical model of LECM-d is shown in Fig. 2. In addition, we propose to combine some heuristic rules with the outputs generated by LECM-d to infer the date information of the extracted events more accurately:
LECM-d: A latent variable model for event extraction and categorization.
Split tweets into bins where each bin corresponds to a specific date. To infer event dates more reliably, we consider both tweets’ publish timestamps and the temporal expressions found in tweets. For tweets without temporal expression, the events discussed in tweets are assumed to happen on the same day
Extract events separately for each bin. The LECM-d model is used to extract events and event types from tweets in each bin. Here, the events extracted do not have the date information.
Infer the date information for each extracted event. We assume that the earliest date when the event is mentioned on Twitter is the date when it happened. Therefore, for each extracted event, the date information is assigned based on the merging step proposed below. Firstly, we compared the events extracted from tweets in nearby bins. Events with overlapping entities and similar keywords are considered as the same event and merged. The merged events are then assigned with the date when the event was first mentioned.
The generative process of LECM-d is shown below.
Draw the event distribution Draw the event type distribution For each event For each event type For each tweet
Choose an event For each named entity occur in tweet For each location occur in tweet For other word positions, choose a word For each event
Choose an event type For each named entity occur in event For each keyword in event
Letting
Here, depending on the word type at each word position
Taking the product of marginal probabilities of tweets in a corpus gives us the probability of the corpus.
We use collapsed Gibbs sampling [10] to infer the parameters of the model and the latent class assignments for events and categories, given observed data
Letting the subscript
where
Letting the subscript
where
Once the class assignments for all events are known, we can easily estimate the model parameters
To improve the precision of event extraction and categorization, we remove the least confident event element from the 4-tuple in LECM and LECM-d using the following rules.
If If
Here,
Our model automatically groups events into different event clusters. For each event cluster, the most prominent semantic class obtained based on the event entities in the cluster is used as the event type label.
In this section, we firstly describe the datasets used in our experiments and then introduce the baseline system for comparison. Then experimental results on filtering, extraction and categorization are subsequently presented. Finally, errors analyses are conducted to give the insights of the proposed framework.
Setup
Two datasets are constructed by collecting tweets in the month of December in 2010. Dataset I contains tweets which are manually annotated as event-related or not for the training of a binary classifier in the filtering step. Tweets are annotated as event-related if relevant news articles can be found in the one-week window before and after the tweets’ publication dates. We argue that this is a reasonable choice since newsworthy events would be more interesting than others. In total, we have 2,891 event-related and 26,000 non-event-related tweets in Dataset I. Dataset II (
The baseline we chose is TwiCal [22], the state-of-the-art open event extraction system on tweets. Each event extracted in the baseline are represented as a 3-tuple
The evaluation is conducted in three aspects: filtering, extraction and categorization.
Tweet filtering. As most tweets in Datasets I and II are not event-related, we only report the performance of classifying event-related tweets. Precision is defined as the proportion of the correctly identified event-related tweets out of the system returned event-related tweets. Recall is defined as the proportion of correctly identified true event-related tweets.
Event extraction. Due to the large volume of tweets in Dataset II, it is almost impossible to know the exact number of events it contains. Therefore, we only report the precision of our event extraction results. For the 4-tuple
Event categorization. The performance is evaluated in two ways, only considering the correct extracted events and using all the extracted events.
As has been previously discussed in Section 3.1, we have explored both keyword-based and classifier-based approaches for tweet filtering. For classifier-based approach, we use Weka [12] to train an SVM with default parameters on Dataset I and perform 3-fold cross validation. For keyword-based approach, news articles were collected from GDELT Event Database (
The performance of tweet filtering on Dataset I
The performance of tweet filtering on Dataset I
Since most tweets in Dataset I are not event-related, it makes sense to only report the results on the event-related class. It can be observed that the SVM-based approach achieves higher precision but with much lower recall rate. It might be attributed to the highly imbalanced training data in Dataset I where only about 10% tweets are event-related. We also tested both keyword-based and SVM-based approaches on Dataset II. Due to the large size of Dataset II, it is impossible to find out the actual performance of both approaches. We instead randomly selected 1,000 tweets identified as event-related by each approach and manually checked the accuracy. We found that the keyword-based approach gives higher precision compared to the SVM-based approach. As such, we chose to use the keyword-based approach for tweet filtering in all the subsequent experiments.
To further understand the effect of our filtering step, examples of the events extracted using our proposed framework with and without filtering are presented in Table 5. It can be observed that without filtering, some extracted events are not really newsworthy events although they also contain named entities and meaningful keywords. For example, there are many tweets talking about watching the movie “Harry Potter". However, it is not considered as a newsworthy event.
Examples of the extracted events with or without filtering
After filtering and pre-processing, less than 250,000 tweets in Dataset II are kept. Figure 3 shows the number of tweets after filtering in each day. It can be observed that the number of tweets varies significantly, of which the minimum is 220 and the maximum is 7,253. Comparing to 200,000 tweets per day in the original data, the filtering step has greatly filtered out non-event-related tweets for subsequent processing.
The number of tweets in each day after filtering.
These tweets are fed into LECM and LECM-d for event extraction and categorization. For LECM-d, we need to group tweets by their potential event dates as mentioned in Section 3.3. Since for each potential event date
The number of extracted events for each day using LECM-d.
The processing time on each day’s data using LECM-d.
The event extraction precisions using TwiCal, LECM, LECM-d are presented in Table 6. As TwiCal outputs a list of events ranked by confidence from high to low, the number of events to be extracted for TwiCal is set to 315 for fair comparison. It can be observed that the filtering step is really crucial to event extraction. By filtering out non-event-related tweets, the precision of our event extraction component increases dramatically from 28.33% to 68.25%. Our proposed framework using LECM-d has the best performance with the precision 78.01% and every event is assigned with a date.
Comparison of the performance of event extraction on Dataset II
When compared against the baseline approach, TwiCal, it can be observed from Table 6 that LECM significantly outperforms the baseline with nearly 3.8% improvement on precision. Moreover, LECM-d further improves upon LECM by 9.75% and outperforms TwiCal by 13.6%. The accuracy of event extraction in each day by LECM-d is shown in Fig. 6. It can be observed that for some days, the precision of event extraction even reaches 100%.
The precision of event extraction in each day using LECM-d.
The significant improvement over TwiCal can be attributed to two main reasons. One is that in a large scale Twitter dataset such as Dataset II, tweets with temporal keywords are rare and many event-related tweets have no date information. As such, TwiCal which relies on the association between named entities and dates for event extraction fails to handle tweets with no date information. The other reason is that TwiCal assumes that one event has only one named entity, which is not true in some cases. For example, in the tweet “Russian President Dmitry Medvedev on Thursday congratulated President Barack Obama on the Senate’s approval of a new nuclear arms control treaty between the countries”, both “Dmitry Medvedev” and “Barack Obama” are involved. Our proposed approach does not impose such a constraint.
To further understand the clustering effect of the proposed LECM-d, we analyze the number of tweets related to each extracted event. The statistics are shown in Fig. 7. It can be observed that most events are mentioned in less than 300 tweets. Only 9 events are mentioned in more than 600 tweets.
The number of tweets versus the number of events.
To see the impact of the event number chosen in LECM-d, we report the extracted results with different number of events
The number of extracted events versus the number of correctly extracted events using different 
We have compared the events extracted by LECM and LECM-d and found that 160 events, about 74.4% of correctly extracted events by LECM, are also correctly extracted by LECM-d. However, 131 events correctly extracted by LECM-d are not discovered by LECM. This shows that our proposed method in inferring date information from tweets could potentially help in improving the recall rate of the system. Table 7 presents some examples of errors generated by simply using the publishing date as the event date but are corrected by our proposed method.
Examples of extraction errors caused by using publishing date as event date. These errors are corrected by the proposed approach
Examples of extraction errors caused by using publishing date as event date. These errors are corrected by the proposed approach
The event extraction and categorization component automatically clusters events into different event types. We empirically set the number of event types to 25 in both LECM and LECM-d. Some example event categorization results generated by LECM-d are presented in Table 8. It can be observed from the results that our event categorization component does group similar events together. We evaluate the precision of event categorization on the correctly extracted events and also on the all extracted events. We found that when evaluated on the correctly extracted events, LECM and LECM-d give similar precision results of 43.87% and 42.57% respectively. However, when evaluated on the all extracted events, LECM-d achieves a precision of 38.3% on event categorization whereas LECM only gives a precision of 29.5%.
Examples of event categorization results. The event type labels are automatic assigned using the most frequent semantic class for entities in each event cluster
Examples of event categorization results. The event type labels are automatic assigned using the most frequent semantic class for entities in each event cluster
To further investigate the performance of the proposed framework, we conduct an analysis on the extraction errors, which can be categorized into three types:
Filtering errors (30%): Some non-event-related tweets have not been filtered properly by the filtering step. This constitutes 30% of the errors. Temporal information errors (10%): Although we have reduced the temporal resolution errors with the pre-processing step and LECM-d, there are still some errors incurred by wrongly recognised event dates. NER errors (10%): Some extraction errors are caused by NER errors. For example, “Red” might denote a color or the name of a person. It might be wrongly extracted as a named entity from tweets. Keyword errors (20%) Event-related keywords might be wrongly identified for some events. For example, for the event “Amy Winehouse died”, words such as “fans” are wrongly identified as event-related keywords. Other errors (30%): The model clusters the tweets with the same named entity, location, date and keywords as describing the same event. However, some different events might have the same date, location, and even share similar keywords. For example, two events “Car bomb explodes in Oslo, Norway” and “gunman opens fire in youth camp in Norway” happened on the same day and in the same country. They might even share the same keywords such as “fire” in some tweets. The LECM model could wrongly extract the same event from two tweets actually mentioning two different events.
In this paper, we have proposed an unsupervised framework for event exploration on Twitter. A pipeline process consists of filtering, extraction and categorization is introduced. All the steps here are fully unsupervised, which makes our proposed framework specifically plausible for analyzing events in the large-scale social stream data. A new method of combining heuristic rules and outputs generated by the modified LECM model has been proposed to infer event dates more accurately. The proposed framework has been evaluated on a large Twitter dataset consisting of 60 million tweets and has achieved a precision of 78.01%, comfortably outperforming a baseline by nearly 13.6%. It also outperforms the previous proposed LECM model by 9.75%. Moreover, events are also clustered into coherence groups with the automatically assigned event type label with an accuracy of 42.57%. Our current model handles tweets in different dates separately. It is possible to explore a dynamic version of our proposed Bayesian model which can take into account the date dependencies to improve the event extraction performance.
Footnotes
Acknowledgments
This work was funded by the National Natural Science Foundation of China (61528302), the Natural Science Foundation of Jiangsu Province of China (BK20161430), the Innovate UK under the grant number 101779 and the Collaborative Innovation Center of Wireless Communications Technology.
