The Panoply

Photo by Henry & Co. on Unsplash

The Panoply: tracking conflicts using large-scale social media data

Constantin Barbu, 24 February 2023
TL;DR

We collect large volumes of media & social media data about ongoing conflicts, then determine the main topics using machine learning and narrate them using GPT.

The problem

Conflicts generate a very large volume of media & social media conversation - they are now documented and discussed in real-time, with millions of people participating in these conversations. But the sheer volume of social media conversation generated by conflicts can be overwhelming.

It can be challenging to sift through the vast amount of data, identify relevant content and extract meaningful insights. Additionally, social media conversation about conflicts can be noisy, with a significant amount of irrelevant content, making it difficult to analyze and interpret the data.

Traditionally, keyword-based methods that involve searching for specific words or phrases in a corpus of text to identify patterns or insights have been used. While this approach can be effective for analyzing small datasets, it has significant limitations when applied to large volumes of text.

One of the main difficulties with keyword-based methods is that they rely on the researcher's ability to anticipate all the relevant words and phrases in the corpus, which is more often than not impossible and can also imprint bias on any results, since the generation of the keywords is not based on the analyzed corpus except to a superficial extent.

The solution

Machine learning-based topic modeling can help make sense of very large text datasets by automatically identifying the most relevant topics present in the data. Rather than relying on manual keyword-based methods, topic modeling algorithms use statistical models and machine learning techniques to identify patterns and relationships within the data. This enables researchers to identify and extract meaningful insights from very large text datasets in a more efficient and accurate manner.

Topic modeling algorithms typically use unsupervised learning techniques to identify clusters of words or phrases that frequently occur together in the data. These clusters are then labeled as "topics", representing themes or subjects that are present in the data. By analyzing these topics, researchers can gain a better understanding of the most prevalent themes and patterns within the data.

Even with state of the art topic modeling, reading the outputs can be confusing, since each topic is labeled with a list of representative words. In order to get any insights, each topic needs to be manually reviewed and labeled more extensively.

Enter GPT: by combining the results of a topic modeling algorithm with GPT, we can generate summaries of the most relevant topics in the data, using natural language that is easy to understand. These summaries can capture much better the nuances and complexity of natural language, while being much easier to evaluate and filter. We use Open AI's Curie GPT model for this task, with excellent results.

The results

The first conflict we'll be analyzing on The Panoply is the Russian invasion of Ukraine . Since the first days of the war, we've been collecting as many tweets as possible on keywords like "Ukraine" / "Ukrainian" or "Russia" / "Russian". For the first days of the war, that meant more than 1 million unique tweets per day, with the conversation later stabilizing at around 100,000 tweets per day.

You can find the first results of our analysis in the Explore section. Please note that topics related to partisan politics, topics that are overly hateful and topics generated by probable propaganda sources (from all conflict participants) have been excluded and will be treated separately.