Article Preview
TopIntroduction
From user groups, online forums, to Facebook, Twitter, Instagram, YouTube, social media platforms have become ubiquitous. The use of social media is particularly prevalent during emergencies. For instance, the Federal Emergency Management Agency (FEMA) wrote in its 2013 National Preparedness report (Maron, 2013) that during and immediately following Hurricane Sandy in 2012 “users sent more than 20 million Sandy-related Twitter posts, or tweets, despite the loss of cell phone service during the peak of the storm.” Such huge amounts of user-generated data contributed by disaster affected communities have become an important source of big crisis data for disaster response (Castillo, 2016; Reuter & Kaufhold, 2018), and at the same time have been used by the public at large to make sense of an event from social media (Stefan, Deborah, Milad, & Christian, 2018). Many research and practical studies have proved the value of social media data on disseminating warning and response information, enhancing situational awareness, facilitating allocation of resources, informing disaster risk reduction strategies and risk assessments (Watson, Finn, & Wadhwa, 2017; Reuter, Hughes, & Kaufhold, 2018; National Research Council, 2013), as well as fostering community resilience (Zhang, Drake, Li, Zobel, & Cowell, 2015). Despite these benefits, the challenges presented by the volume of the data still preclude large emergency organizations from using them routinely (Meier, 2013).
Manually sifting through voluminous streaming data to filter useful information in real time is inherently impossible. Machine learning techniques show promising results in automating the process of identifying useful, relevant and trustworthy information in big crisis data (Qadir et al., 2016), despite many practical challenges (Mendoza, Poblete, & Castillo, 2010). Many works have successfully used supervised learning algorithms to automatically classify tweets (Caragea, Squicciarini, Stehle, Neppalli, & Tapia, 2014; Imran, Elbassuoni, Castillo, Diaz, & Meier, 2013). Supervised algorithms require labeled training data to learn classifiers that can be further used to label new data of the same type (also called test data). The labels generated for the test data are usually accurate when the training and the test data are drawn from the same distribution.
The requirements above result in two main challenges that machine learning algorithms face when used to classify user-generated tweets about emerging disasters such as floods, hurricanes, and terrorist attacks. First, labeled data is not easily available for an emergent “target” disaster for which a classifier is needed to help disaster response teams identify relevant tweets, and ultimately information useful for situational awareness. Labeling data is an expensive and time-consuming process, which does not provide a real-time solution for disaster response. Labeled data from a prior “source” disaster can potentially be used to learn a supervised classifier for the target disaster (Starbird, Palen, Hughes, & Vieweg, 2010). However, another challenge is posed by the fact that data from the “source” disaster and data from the target disaster may not share the same distribution (or characteristics), and the classifier learned from the source may not perform well on the target.