The COVID-19: The First Public Coronavirus Twitter Dataset consists of tweets gathered from Twitter according to search criteria consisting of a set of hashtags that try to capture keywords that are variations of the terms: coronavirus, covid-19, wuhan and china, epidemic/pandemic and outbreak. It also follows tweets from certain accounts such as the World Health Organisation (WHO) and the Centre for Disease Control Prevention (CDC). Ten different (European and Non-European) languages are featured among tweets starting from the evening of the 21st January 2020 and that continue to be updated to this day. We aim to keep up-to-date with the latest version of the dataset but this is not always possible as processing data and retraining models takes time and this sometimes requires altering our analysis. This dataset is an enormous source of information for us, data scientists, and we hope you will find our analysis as insightful as we did when exploring the data.
The purpose of this report is primarily to explore the availability of twitter data for our national and regional data exploration. The most obvious measurement of this is the number of tweets available over time. This also seems to be a reasonable proxy for identifying changes in the response of Twitter users to the COVID-19 pandemic. As such we also attempt to identify key trends in Twitter usage at a global, national and regional level for our nations of interest: the United Kingdom, France and Italy.
During our data exploration, we very quickly discovered that there were strange patterns in the data. Some words appeared very frequently that didn’t seem to fit in the with expected theme of the Coronavirus outbreak. We also discovered that there were bizarre peaks in tweet usage that didn’t correspond to any known Coronavirus outbreak events. After careful investigation of the data, these strange artifacts seemed to come from three different types of Twitter users:
These users are not necessarily malicious and as such, could be considered legitimate Twitter content. However, it skewed our data in an undesirable and confusing way. There were only a small fraction of users that seemed to post duplicate content and so we decided to identify these users and remove them from the dataset. We could have just removed the duplicate tweets but we decided to exclude the user entirely as we did not use a very sophisticated method for identifying duplicates, partly because the dataset is very large and scaling to sophisticated methods is beyond the scope of our COVID-19 efforts, and users that posted duplicates tended to fall into one of the three categories that we identified most of the time.
Before starting to analyze the dataset in detail in subsequent reports, we would first like to gain a general sense of the broad structure of our dataset and the availability of data. To begin with, you can see the distribution of tweets over time for the entire dataset that we currently have. All the time series on this page are smoothed via a statistical method called the Moving Average. More specifically, we used a moving average window of size three, since we didn’t want to smooth the peaks too much using a larger window size. As a consequence of this, there is a little lag between a known event that results in a change in tweet usage and how that is shown in the time series.
Here is a list of key global events for the spread of the Coronavirus. Given what we already know about the timeline of the pandemic, it’s natural to try and find corresponding changes in Twitter usage. We can’t be exhaustive in the list of key events we consider as there are many. Instead, we constructed these events based on our experience of working on various kinds of analysis with the dataset.
At Sony CSL Paris, we pride ourselves with having a diverse range of talents, nationalities and personalities. We wanted to bring this diversity to our research of COVID-19 by exploring and comparing various European countries that we are personally familiar with and that were very strongly affected by the outbreak of COVID-19. In particular: the United Kingdom, Italy and France. This type of international comparisons on textual data is not typical, as it requires mutli-lingual knowledge and familiarity with different national contexts. However we took on this challenge and believe that it offers valuable insights into how different nations responded to the COVID-19 pandemic.
Identifying regional information isn’t straightforward from tweets. A simple start is to firstly remove tweets that don’t correspond to the main language spoken in the countries we are interested in exploring (The United Kingdom, France and Italy). Below, you can find the time series corresponding to three languages. Interestingly enough, on February the 22nd, Tweets in Italian were far more than tweets in English which is really interesting considering that Twitter is is not widely spread in Italy and in general Italian is not spread as much as English over the world.
This classification is clearly not appropriate for identifying our desired regions. In particular, English is spoken across many countries and is a very common second language. To overcome this issue, we use twitter user’s self-identified locations. This greatly reduces the data available as this is optional information, but we found that a large number of people still self report their location. There is, of course, the issue that these locations cannot be verified and some are clearly false - you would be surprised how many people live in Antarctica or on the moon! However, we think it’s reasonable to assume that the vast majority of the locations that refer to ‘real’ places in the world are accurate.
To identify countries and regions within them, we used a hand-curated list of cities, counties and regions and match them with the location text. We also use negative examples, that ignore tweets if they are found in the location text, to rule out locations that should not be included. For example, in the context of the United Kingdom, “New York” should not be considered and might be conflated with York (a city in England). We manually identified an effective list by including the most populated cities in each region as well as identifying the regions themselves. This required some local context such as knowing that Wales in the United Kingdom is also referred to as Cymru. Concerning the negative examples, we iteratively added new negative examples and observed resulting matches until we were sure that we were not including inappropriate locations. The final proportions of our dataset can be found below. The final amount of tweets for each language is less than in the original language dataset; this is because most of the users’ locations were not always clear and not everyone reports their location. However, there was sufficient data to work with.
Though the national level seems like a reasonable level of granularity, we decided to explore an even finer-grained level of spatial resolution: the regional level. In fact, we distinguished the tweets coming from different regions for each of the three countries of interest. We decided to do this because the spread of the Coronavirus struck harder in some regions than others and we felt this disparity might be reflected in the way twitter users tweet. Here is a general overview of the tweet usage of the different regions for each country. These curves are normalised so that we can compare how the overall distribution of tweets varies between regions. Moreover, the Timeseries are smoothed with a statistical method called Moving Average. More in particular, implementing our plots, we used a moving average window of three since we didn’t want to smooth the peaks too much using a larger window size. This is why the peaks do not correspond to major events but they are most of the time three days late on the plots. We go into more detail of each country in our other reports but this overview already gives us some insight into how countries expressed the COVID-19 crisis differently.
Here are key events in the United Kingdom for the spread of the Coronavirus. Given what we already know about the timeline of the pandemic, it’s natural to try and find corresponding changes in Twitter usage. We can’t be exhaustive in the list of key events we consider as there are many. Instead, we constructed these events based on our experience of working on various kinds of analysis with the dataset.
In the United Kingdom time series we have two major peaks the first one is around the 31st of January when there were the first confirmed cases in England which affected two Chinese tourists a York. It was at this date that authorities raised the risk level for the United Kingdom from low to moderate. The regions most concerned about it are Northern Ireland and Wales which have a big peak during the three days going from the 29th of January and the 3rd of February. Surprisingly, this is not the case for England, even if the first cases were detected in York.
Over the period starting from the 27th February till the 6th March we have quite important peaks for all the regions following the first confirmed cases in Wales, Northern Ireland and Scotland. All the regions reached their peaks around the 3rd of March following Boris Johnson announced the Coronavirus being a “four-level incident”. It’s quite interesting that between the 6th and 9th March Northern Ireland tweets are proportionally more between march 6-10th. This could be related to the canceling of Belfast’s St Patrick Day’s parade.
Surprisingly enough, the declaration of the non-voluntary Lockdown on the 23rd March by Boris Johnson didn’t correspond to a raise of the Tweets. Maybe this is because the nation was already into a voluntary lockdown respected by the majority of the population.
Around the 5th of April there are minor peaks in all regions, maybe following, the announcement of Boris Johnson admission to hospital and later on to the intensive care unit due to a Coronavirus infection.
The 10th of May other minor peaks followed the announcement of Boris Johnson which was perceived as “divisive, confusing and vague” according to the Guardian
Here are key events in France for the spread of the Coronavirus. Given what we already know about the timeline of the pandemic, it’s natural to try and find corresponding changes in Twitter usage. We can’t be exhaustive in the list of key events we consider as there are many. Instead, we constructed these events based on our experience of working on various kinds of analysis with the dataset.
In the France Timeseries, minor peaks are present around the 8th of February when cases started to rise and all travels to China were forbidden. The same happens around the 15th of February when the first death in Europe occurs in Paris. We can also notice an increase in tweet activity about COVID-19 between February 24th until March 22nd which spans the period that initial cases started rising and all major lockdown measures were adopted. The peak around March 5th could have been due to the Mulhouse COVID-19 cases coming to light.
As with the timeseries of the United Kingdom, it’s interesting that the lockdown itself does not create a great impact on Twitter usage. In fact, we can observe just minor peaks around the 16th of March when French President Emmanuel Macron announced mandatory home confinement. Perhaps it’s because, like the United Kingdom, France had already seen the lockdown coming because Italy was far ahead of it in terms of cases and deaths. It was a fairly likely measure to be taken by that time. The announcement of possible ease of the lockdown creates minor peaks as well around the 13th of April exactly when Emmanuel Macron announced that the containment could be lifted on May 11, at least partially. Once again the Tweets about Coronavirus do not have a huge increase. The reason once again could be that the deconfinement was quite expected as well since other countries, such as Italy or Austria had been announced it.
Overall we see that tweet trends are fairly similar across France. This is different from the United Kingdom where if Wales and England had very similar patterns in usage, Scotland and Northern Ireland deviated from England and Wales in some periods.
Here are key events in Italy for the spread of the Coronavirus.
Unlike France, Italy has two very prominent peaks and all regions of Italy share very similar patterns. These two peaks correspond to the initial lockdown in parts of northern Italy on February 21st followed by the nationwide lockdown on March 8th. Unlike the other regions, the twitter usage seems to be dominated by the lockdowns announcements. The start of the end of the lockdown also struck the Twitter usage : in fact, we can see minor peaks around the 4th of May, the start of the so-called phase two.
*** disclaimer: Italian Regions ‘Valle d’Aosta’ and ‘Molise’ are not present in the graphs because tweets coming from these regions were not present for each date in the dataset, resulting in a curve with value 0 most of the time.
This report only served as a high-level overview of the structure of the data, and the availability of data on the countries and regions that we wish to explore. However, even with this very coarse exploration of the data we uncovered some interesting insights.
It is remarkable how clear the focus of Twitter usage was in Italy on their lockdown and how this wasn’t such a feature in the United Kingdom or France. This may be because Italy was the front-runner in the disease outbreak in Europe and the United Kingdom and France had time to come to expect similar actions to be taken in their countries. In particular, in France, though there was a slight increase in Twitter usage during March, there weren’t any particularly high peaks in twitter usage.
We have seen that it is not always easy to correspond known key dates concerning the progression of the Coronavirus outbreak to changes in twitter behavior, at least at the level of twitter usage. There are however some clear correspondences some of the time and so this information is still useful. However, our intuitions about how twitter usage responds to key milestones in the outbreak are only a reasonable starting point.