The Challenges of “More Data” for Protest Event Analysis

Abstract

1. Introduction

We thank the editor for the opportunity to engage in scholarly dialogue, and we are grateful to the commentators for their insightful and constructive critiques. We agree in many ways with our colleagues. Conceptually, we agree with Hutter’s observation (this volume, pp. 58–63) that our use of the term collective action events is restrictive and more restrictive than current usage of the term in sociology. As we continue this project, we are interested in expanding CASM to include a wider range of action forms, such as Internet activism and collective petitioning. Operationally, Oliver’s comment that humans are imperfect at identifying protests absolutely resonates with our experience in having research assistants code training and validation data. We spent a great deal of time training our human coders and improving our coding rules for constructing the second-stage training data and the validation data set, but despite high intercoder reliability, there were always ambiguous “edge” cases. Here, we believe that conducting out-of-sample validation plays a crucial role in assessing the extent to which an automated approach minimizes the blurriness of what constitutes an event. We also appreciate Oliver’s helpful comments (this volume, pp. 63–68) about how to incrementally improve the China data. We have explored several of these areas, and results (e.g., on keyword set size) can be seen in the Appendix.¹ Methodologically, we agree with Steinert-Threlkeld (this volume, pp. 68–75) on the value of image data as well as multimodal data. Finally, all three commentaries encourage us to expand CASM to cover more characteristics (which Hutter refers to as “subdimensions”) of collective action events—for example, size, action form, claims/issues, targets, organizers, and violence. We wholeheartedly agree, and we are actively pursuing this area now.

Another commonality between the three commentaries is the idea that more data—more subdimensions of events (Hutter), more modes of data (Steinert-Threlkeld), more media sources (Oliver)—will improve our ability to correctly identify events. More data can absolutely improve our understanding of events, but it may harm the precision and recall of event detection when automated systems of event identification are used to integrate these data. We want to spend the main portion of this rejoinder discussing the challenges of more data in machine classification of events. Additional methodological work is needed to effectively incorporate additional dimensions and sources of data for automated methods of event identification.

Hutter writes that measuring more subdimensions of protest events—for example, action form, claims/issues, targets, organizers—can increase precision of event identification by reducing duplication and minimizing false positives. Steinert-Threlkeld notes that using multimodel data is one way of overcoming the limited diversity of events that can be detected in images. Steinert-Threlkeld recommends using text and image data, as we have done, as well as adding in metadata such as screen names, biographic descriptions, and image captions to expand the diversity of events identified through social media.

Currently, we group posts located in the same county² and on the same date into one collective action event. The location either comes from geolocation metadata or the text of the post. The date is taken from the post’s metadata. Now, imagine that we add two more subdimensions: target and protest size. Suppose we identify the target from the text of the post and identify protest size from the image because, as Steinert-Threlkeld notes, image data increase the precision of measuring crowd size. It is easy to imagine how adding these two additional subdimensions would improve event identification. For example, we might see that some posts made on the same day in the same county are about a large-scale protest targeting a polluting factory and other posts are about a medium-sized protest targeting a government bureau. In this case, by adding the target and protest size subdimensions to protest location and date, we improve the precision and recall of event identification.³

However, to incorporate the target and protest size subdimension, or any subdimension for that matter, we would have to extract subdimension information from the text, image, or metadata. And, as with any classification system (especially machine-based), there will be missing data (some posts will not contain information about target or protest size) and incorrect classification (the target or protest size will be incorrectly identified). As more subdimensions are added for event identification, missing data can lead to underreporting the number of events, and error in coding subdimensions can lead to overreporting the number of events and misattribution of posts to events. These problems are especially significant if we use exact matching methods, as in our article.

2. Missingness

If we use exact matching on $k$ subdimensions (e.g., location, time, target, size), each post in $T_{protest}$ will have $k$ subdimensions, and we consider posts the same event if all $k$ subdimensions have the same value $m$ . However, not every post will have a value for all $k$ subdimensions, and this is especially true of social media data, which can be short and idiosyncratic. Assume that for each subdimension, there is a proportion of posts $p_{i}$ , where $i = 1, . . ., k$ , that we cannot code. If an algorithm requires every attribute be matched in order to group posts into events, posts whose subdimensions are missing must be dropped. The proportion of posts with complete coded data would then be

Π_{i = 1}^{k} 1 - p_{i} .

Imagine a simple case where we can code 80 percent of posts on each subdimension $k$ : $p_{1} = p_{2} = \dots = p_{k} = 0.8$ . This is reasonable for some subdimensions and likely optimistic for others.⁴ If we have $k = 2$ subdimensions (location and date), the proportion of posts that are not missing on each attribute is $0 . 8^{2} = 0.64$ , which means we keep 64 percent of the data and drop 36 percent. If we increase the number of subdimentions to $k = 4$ (location, date, target, size), the proportion of posts that are not missing each attribute is $0 . 8^{4} = 0.41$ , which means we keep 41 percent of the data and drop 59 percent. The amount of data we throw away keeps increasing with the number of subdimensions, and the number of events we identify decreases rapidly.

Of course, this scenario may be too extreme. In practice, we would develop methods to utilize posts with missing subdimensions. For example, we may require matching only on a set of subdimensions smaller than $k$ . Moving away from exact matching, however, can make event grouping more susceptible to error. Posts may be incorrectly grouped into an event because of missingness.⁵ In the presence of missing data, adding additional subdimensions does not automatically improve event identification. It can lead to serious underreporting, which degrades recall (false negatives increase) if we do not carefully consider how to address missing subdimensional information in event grouping.

3. Error in Subdimension Coding

Even if we do not have the problem of missing subdimensions, some subdimensions may be incorrectly classified. Suppose again that each post in $T_{protest}$ will have $k$ subdimensions. For a particular event $e$ , let us assume that correctly coded $k$ subdimensions are $< m_{1 c}, m_{2 c}, \dots m_{kc} >$ , where $c$ denotes correct classification. If subdimension $k = 2$ were incorrectly coded for a particular post $a$ , we would observe a different set of subdimension values: $< m_{1 c}, m_{2 w}, \dots m_{kc} >$ , where $w$ denotes correct classification. In a full data set, we would know there is incorrect classification, but we would not know what subdimension or posts were incorrectly classified. As a result, $< m_{1 c}, m_{2 c}, \dots m_{kc} >$ and $< m_{1 c}, m_{2 w}, \dots m_{kc} >$ would be treated as two different events because the value of their subdimensions differs on subdimension $k = 2$ . This one error in subdimension coding would lead to the creation of an additional event that does not actually exist (false positive).

More generally, imagine that $n$ posts are associated with an event. If every subdimension is correctly classified, we can recover this event from $n$ posts. This time, let us use $p_{i}$ to denote the proportion of each subdimension $i = 1, . . ., k$ that we do not correctly classify. In the presence of classification errors described previously, the expected number of additional incorrect events we would identify is

(n - 1) Π_{i = 1}^{k} 1 - p_{i} .

Assuming we have four subdimensions and the $p_{i}$ is 0.9 for all $i = 1, . . ., k$ (which is a relatively low misclassification rate for machine learning algorithms), then the expected number of additional incorrect events we identify is

(n - 1) \cdot 0 . 9^{4} .

This number increases with the number of total posts ( $n$ ) associated with each event. It increases as classification error increases (as $p_{i}$ decreases), and it increases as more subdimensions are added. When subdimension coding errors are present, adding more subdimensions will generate events that do not exist in reality and overestimate the number of events.

Errors in subdimension coding could also misattribute posts to the wrong event, which has implications for our understanding of the characteristics of protest events. Continuing with our previous example, in which post $a$ related to event $e$ , which should have subdimension coding $< m_{1 c}, m_{2 c}, \dots m_{kc} >$ , is instead coded $< m_{1 c}, m_{2 w}, \dots m_{kc} >$ ; if $< m_{1 c}, m_{2 w}, \dots m_{kc} >$ does not generate a new event, it is because $< m_{1 c}, m_{2 w}, \dots m_{kc} >$ is the correct subdimension coding for another event $e'$ . Incorrect subdimension coding would lead us to wrongly assign post $a$ to event $e'$ . No additional event is created, but post $a$ is misattributed to event $e'$ when it should have been grouped with event $e$ . This misattribution can affect our understanding of protest events. For example, if we want to understand what types of protests garner more social media attention (as measured by retweets) and our misattributed post $a$ has a large number of retweets, this misattribution would bias our estimate of the quantity of interest.

The relative probability that incorrect subdimension coding would lead to misattribution as opposed to generation of events that do not exist depends on the number of values (or levels) any subdimension takes on (which may decrease $p_{i}$ ) and the number of subdimensions $k$ . As the number of values of any subdimension or the number of subdimensions increases, incorrect coding of subdimensions is more likely to result in overreporting of events as opposed to misattribution. For example, suppose we have three subdimensions: location, date, and violence. If the violence subdimension only takes two values (violent or not), then the chance that incorrect coding of the post’s violence subdimension will result in the post being attributed to another event is higher than if the violence subdimension is a continuous variable. If the violence subdimension is a continuous variable, incorrect classification of this dimension is more likely to result in generation of a new event. Or, if we have six subdimensions, holding the levels of the overlapping subdimensions constant, incorrect coding of any subdimension will more likely result in overreporting of events (false positives) than in misattribution of posts to actual events.

More broadly, the problems of missingness and incorrect classification may also hinder attempts to merge events from different sources. Oliver writes that “the ideal would be to develop protocols that allow events collected in different ways from different sources to be merged.” If such protocols were automated, the same problems of underestimation due to missingness and overestimation and misattribution due to incorrect classification would be present. Because of missingness in one source or another, only a subset of events would be identified through merging. If there are similar errors in coding subdimensions in two data sets being merged, these errors could lead to overestimation in the number of events or misattribution of posts to events. As Schrodt (2015:6) notes, in automated event coding systems, “as the number of sources (and hence texts) increases, we see diminishing returns on the likelihood of a correct coding, but a linear increase in the number of incorrectly coded events.”

We illustrate why more data does not automatically improve event grouping (or de-duplication) in machine coding. Hutter’s suggested remedy of adding more subdimensions and Oliver’s recommendation of a protocol for merging events would work well with human coding, but the scale of data we are working with does not allow us to extract subdimensions of events and merge events with other protest event data sets entirely by hand. Potential solutions include moving away from exact matching and deterministic algorithms to probabilistic algorithms, such as probabilistic record linkage and their extensions in large-scale data settings (Enamorado, Fifield, and Imai 2019; Fellegi and Sunter 1969; Xiao et al. 2011). In our article, we rely on two subdimensions—location and date—to group posts to an event because we do not yet have a highly reliable way of coding additional subdimensions. Similarly, the relatively smaller number of levels we use for location is driven by considerations of precision in coding this subdimension. Using more fine-grained levels for location (e.g., township, village, landmarks) meant more missing data and errors in coding. We experimented with different grouping methods and found that relying on two subdimensions generated the best results under our current exact matching methods. Looking forward, more work is needed to expand CASM to include more action forms; more accurately extract subdimensions of events from text, image, and other multimodal data; and develop methods to better utilize additional subdimensions to improve the precision and recall of event identification and merge different protest event data sets.

Footnotes

Notes

Author Biographies

The author biographies can be found on page 57 of this volume.

References

Enamorado

Ted

Fifield

Benjamin

Imai

Kosuke

. 2019. “Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records.” American Political Science Review 113:353–71.

Fellegi

Ivan P.

Sunter

Alan B

. 1969. “A Theory for Record Linkage.” Journal of the American Statistical Association 64:1183–210.

Schrodt

Philip A.

2015. “Comparison Metrics for Large Scale Political Event Data Sets.”Presented at the European Political Science Association Annual Meeting, Vienna, June25.

Xiao

Chuan

Wang

Wei

Lin

Xuemin

Xu Yu

Jeffrey

Wang

Guoren

. 2011. “Efficient Similarity Joins for Near-Duplicate Detection.” ACM Transactions on Database Systems (TODS) 36:15.