Dataset Overview

Michael Anslow, Martina Galletti
SONY Computer Science Lab - Paris

The COVID-19: The First Public Coronavirus Twitter Dataset consists of tweets gathered from Twitter according to search criteria consisting of a set of hashtags that try to capture keywords that are variations of the terms: coronavirus, covid-19, wuhan and china, epidemic/pandemic and outbreak. It also follows tweets from certain accounts such as the World Health Organisation (WHO) and the Centre for Disease Control Prevention (CDC). Ten different (European and Non-European) languages are featured among tweets starting from the evening of the 21st January 2020 and that continue to be updated to this day. We aim to keep up-to-date with the latest version of the dataset but this is not always possible as processing data and retraining models takes time and this sometimes requires altering our analysis. This dataset is an enormous source of information for us, data scientists, and we hope you will find our analysis as insightful as we did when exploring the data.

Purpose of Analysis

The purpose of this report is primarily to explore the availability of twitter data for our national and regional data exploration. The most obvious measurement of this is the number of tweets available over time. This also seems to be a reasonable proxy for identifying changes in the response of Twitter users to the COVID-19 pandemic. As such we also attempt to identify key trends in Twitter usage at a global, national and regional level for our nations of interest: the United Kingdom, France and Italy.

Bots, activists and repeated content

During our data exploration, we very quickly discovered that there were strange patterns in the data. Some words appeared very frequently that didn’t seem to fit in the with expected theme of the Coronavirus outbreak. We also discovered that there were bizarre peaks in tweet usage that didn’t correspond to any known Coronavirus outbreak events. After careful investigation of the data, these strange artifacts seemed to come from three different types of Twitter users:

News aggregation bots
Advertising
Activists

These users are not necessarily malicious and as such, could be considered legitimate Twitter content. However, it skewed our data in an undesirable and confusing way. There were only a small fraction of users that seemed to post duplicate content and so we decided to identify these users and remove them from the dataset. We could have just removed the duplicate tweets but we decided to exclude the user entirely as we did not use a very sophisticated method for identifying duplicates, partly because the dataset is very large and scaling to sophisticated methods is beyond the scope of our COVID-19 efforts, and users that posted duplicates tended to fall into one of the three categories that we identified most of the time.

Global tweets

Before starting to analyze the dataset in detail in subsequent reports, we would first like to gain a general sense of the broad structure of our dataset and the availability of data. To begin with, you can see the distribution of tweets over time for the entire dataset that we currently have. All the time series on this page are smoothed via a statistical method called the Moving Average. More specifically, we used a moving average window of size three, since we didn’t want to smooth the peaks too much using a larger window size. As a consequence of this, there is a little lag between a known event that results in a change in tweet usage and how that is shown in the time series.

Key Events

Here is a list of key global events for the spread of the Coronavirus. Given what we already know about the timeline of the pandemic, it’s natural to try and find corresponding changes in Twitter usage. We can’t be exhaustive in the list of key events we consider as there are many. Instead, we constructed these events based on our experience of working on various kinds of analysis with the dataset.

Date	Event
31st December	Wuhan Municipal Health Commission in Hubei Province, China, reports a cluster of pneumonia cases.
1st January	Chinese doctors warn of a new unknown virus they call 2019n-CoV
4th January	WHO reports on social media the existence of a cluster of pneumonia cases - with no deaths - in Wuhan, Hubei province.
11th January	First reported death in China and China publicly discloses the genetic sequence of the VIDOC virus-19.
13th January	Authorities confirm one case of COVID-19 in Thailand, the first case reported outside China
23rd January	Wuhan Lockdown.
30th January	The WHO declares a global emergency.
2nd February	First reported death outside of China.
5th February	The Diamond Princess cruise ship quarantined in Yokohama, Japan.
11th February	Coronavirus disease named COVID-19, an acronym of \|CO\|rona\|VI\|rus Disease 20\|19\|.
15th February	First COVID death in europe.
21st February	Initial lockdown of eleven municipalities of the province of Lodi in Lombardy.
29th February	First death in the United States.
3rd March	Sharp increase in number of cases in Spain.
9th March	Italy Nationwide Lockdown.
11st March	WHO declares outbreak a pandemic.
12nd March	United States announces that it will ban travellers who have visited the Schengen area from its territory.
13rd March	WHO says that now “Europe is at the epicentre of the pandemic.”.
17th March	The Schengen area closes its borders to foreigners for one month
27th March	The United States, whose epicentre is New York with more than 500 deaths, exceeds 100,000 reported cases.
3rd April	The number of infected people in the world is over one million.
8th April	The Chinese authorities lift the closure of the city of Wuhan, the focus of the pandemic, after two months of confinement.
11th April	The United States becomes the world’s most mourning country for the pandemic, with more than 20,000 deaths recorded for more than 500,000 documented cases. The day before, the country was the first in the world to exceed 2,000 deaths in 24 hours.
14th April	Trump announces he is halting funding to the WHO while a review is conducted, saying the review will cover the WHO’s “role in severely mismanaging and covering up the spread of coronavirus.”.
20th April	Anti-confinement demonstrations after Trump’s calls for the release of states held by “Democrats”.
1st May	The US Food and Drug Administration issues an emergency-use authorization for drug remdesivir in hospitalized patients with severe Covid-19. FDA Commissioner Stephen Hahn says remdesivir is the first authorized therapy drug for
13th May	Dr. Mike Ryan, executive director of the WHO’s health emergencies program, warns that the coronavirus may never go away.
23th May	China reports no new symptomatic coronavirus cases.
27th May	Data collected by Johns Hopkins University reports that the coronavirus has killed more than 100,000 people across the US.
2nd June	Wuhan’s Health Commission announces that it has completed coronavirus tests on 9.9 million of its residents with no new confirmed cases found.

Analysis

The first peak on the 31st of January seems to correspond to the declaration of the Coronavirus outbreak as a global emergency.
The March the 1st peak could correspond to the first death in the United States as Twitter users make up the lion’s share of the dataset but without looking deeper into the data it would not be possible to know.
The same goes for the peak on March the 5th and March the 11th which may correspond respectively to the increase in deaths in Spain and to the Lockdown in Italy, but it’s unclear why this would be a globally significant event.
The April 3rd peak could correspond to the announcement that the number of infected people in the world was over one million.
In general, this shows how simply knowing something about the timeline of events doesn’t necessarily translate well into changes in Twitter behaviour. The global Twitter trends are fundamentally noisy, given that they are a mixtures of all Twitter users. As such, it’s hard to discern the reason for particular patterns without taking a closer look at the data.

Languages and National information

At Sony CSL Paris, we pride ourselves with having a diverse range of talents, nationalities and personalities. We wanted to bring this diversity to our research of COVID-19 by exploring and comparing various European countries that we are personally familiar with and that were very strongly affected by the outbreak of COVID-19. In particular: the United Kingdom, Italy and France. This type of international comparisons on textual data is not typical, as it requires mutli-lingual knowledge and familiarity with different national contexts. However we took on this challenge and believe that it offers valuable insights into how different nations responded to the COVID-19 pandemic.

Identifying regional information isn’t straightforward from tweets. A simple start is to firstly remove tweets that don’t correspond to the main language spoken in the countries we are interested in exploring (The United Kingdom, France and Italy). Below, you can find the time series corresponding to three languages. Interestingly enough, on February the 22nd, Tweets in Italian were far more than tweets in English which is really interesting considering that Twitter is is not widely spread in Italy and in general Italian is not spread as much as English over the world.

This classification is clearly not appropriate for identifying our desired regions. In particular, English is spoken across many countries and is a very common second language. To overcome this issue, we use twitter user’s self-identified locations. This greatly reduces the data available as this is optional information, but we found that a large number of people still self report their location. There is, of course, the issue that these locations cannot be verified and some are clearly false - you would be surprised how many people live in Antarctica or on the moon! However, we think it’s reasonable to assume that the vast majority of the locations that refer to ‘real’ places in the world are accurate.

Regional information

To identify countries and regions within them, we used a hand-curated list of cities, counties and regions and match them with the location text. We also use negative examples, that ignore tweets if they are found in the location text, to rule out locations that should not be included. For example, in the context of the United Kingdom, “New York” should not be considered and might be conflated with York (a city in England). We manually identified an effective list by including the most populated cities in each region as well as identifying the regions themselves. This required some local context such as knowing that Wales in the United Kingdom is also referred to as Cymru. Concerning the negative examples, we iteratively added new negative examples and observed resulting matches until we were sure that we were not including inappropriate locations. The final proportions of our dataset can be found below. The final amount of tweets for each language is less than in the original language dataset; this is because most of the users’ locations were not always clear and not everyone reports their location. However, there was sufficient data to work with.

Though the national level seems like a reasonable level of granularity, we decided to explore an even finer-grained level of spatial resolution: the regional level. In fact, we distinguished the tweets coming from different regions for each of the three countries of interest. We decided to do this because the spread of the Coronavirus struck harder in some regions than others and we felt this disparity might be reflected in the way twitter users tweet. Here is a general overview of the tweet usage of the different regions for each country. These curves are normalised so that we can compare how the overall distribution of tweets varies between regions. Moreover, the Timeseries are smoothed with a statistical method called Moving Average. More in particular, implementing our plots, we used a moving average window of three since we didn’t want to smooth the peaks too much using a larger window size. This is why the peaks do not correspond to major events but they are most of the time three days late on the plots. We go into more detail of each country in our other reports but this overview already gives us some insight into how countries expressed the COVID-19 crisis differently.

The United Kingdom

Key Events

Here are key events in the United Kingdom for the spread of the Coronavirus. Given what we already know about the timeline of the pandemic, it’s natural to try and find corresponding changes in Twitter usage. We can’t be exhaustive in the list of key events we consider as there are many. Instead, we constructed these events based on our experience of working on various kinds of analysis with the dataset.

Date	Event
30th January	Risk raised from low to moderate.
31st January	First confirmed cases in England.
4th February	UK tells all Britons to leave China ‘if they can’
12nd February	First Diagnosis in London.
28th February	First confirmed case in Wales.
29th February	First UK death and first confirmed case in Northern Ireland.
1st March	First confirmed case in Scotland.
3rd March	Boris Johnson announces the Coronavirus is a “level 4 incident”. The level is one level beneath level 5 which is a collapse of the health care system.
6th March	St.Patrick day suspended.
16th March	Boris Johnson advises against non-essential work and social gatherings.
19th March	Queen Elizabeth II, retired to Windsor Castle “as a precautionary measure”.
20th March	All leisure establishments close such as pubs and restaurants.
23rd March	UK goes into non-voluntary lockdown enforceable by the police.
27th March	Prime Minister Boris Johnson announced that he had tested positive for the coronavirus.
2nd April	Government announced the aim of conducting 100,000 tests a day by the end of April.
5th April	Prime Minister Boris Johnson is admitted to Guy’s and St. Thomas Hospital.
6th April	Prime Minister Boris Johnson was admitted to intensive care in the same hospital.
8th April	First Minister of Wales, confirmed the Welsh Government would extend the lockdown beyond the initial three-week period for Wales.
16th April	Lockdown extended for at least another three weeks.
22th April	The Financial Times reported that the actual number of deaths due to the epidemic is more than 41,000, more than twice the official number.
28th April	Kawasaki Syndrome was reported in children.
29th April	The number of people who have died with coronavirus in the UK passed 26,000, as official figures include deaths in the community, such as in care homes, for the first time.
30th April	Boris Johnson said the country was “past the peak of this disease”.
5th May	The UK death toll became the highest in Europe and second highest in the world.
7th May	Lockdown in Wales extended with some slight relaxations.
10th May	Prime Minister Johnson asked those who could not work from home to go to work. In his address he changed the ‘Stay at Home’ slogan to ‘Stay Alert’.
28th May	Easing of the lockdown in Scotland.

Analysis

In the United Kingdom time series we have two major peaks the first one is around the 31st of January when there were the first confirmed cases in England which affected two Chinese tourists a York. It was at this date that authorities raised the risk level for the United Kingdom from low to moderate. The regions most concerned about it are Northern Ireland and Wales which have a big peak during the three days going from the 29th of January and the 3rd of February. Surprisingly, this is not the case for England, even if the first cases were detected in York.

Over the period starting from the 27th February till the 6th March we have quite important peaks for all the regions following the first confirmed cases in Wales, Northern Ireland and Scotland. All the regions reached their peaks around the 3rd of March following Boris Johnson announced the Coronavirus being a “four-level incident”. It’s quite interesting that between the 6th and 9th March Northern Ireland tweets are proportionally more between march 6-10th. This could be related to the canceling of Belfast’s St Patrick Day’s parade.

Surprisingly enough, the declaration of the non-voluntary Lockdown on the 23rd March by Boris Johnson didn’t correspond to a raise of the Tweets. Maybe this is because the nation was already into a voluntary lockdown respected by the majority of the population.

Around the 5th of April there are minor peaks in all regions, maybe following, the announcement of Boris Johnson admission to hospital and later on to the intensive care unit due to a Coronavirus infection.

The 10th of May other minor peaks followed the announcement of Boris Johnson which was perceived as “divisive, confusing and vague” according to the Guardian

France

Key Events

Here are key events in France for the spread of the Coronavirus. Given what we already know about the timeline of the pandemic, it’s natural to try and find corresponding changes in Twitter usage. We can’t be exhaustive in the list of key events we consider as there are many. Instead, we constructed these events based on our experience of working on various kinds of analysis with the dataset.

Date	Event
24th January	First reported case. Earlier cases were later found going back to the 27th of December and even one at the end of November 2019.
9th February	Five new cases in France, travel to China not recommended “unless imperative reason”.
15th February	First death in Europe occurs in Paris.
17-24th February	Large Christian Open Door Church event in Mulhouse where half of participants are believed to have contracted the virus.
18th February	Death tolls approaches 1,900 in China; increase in cases detected in “Diamond Princess”.
29th February	Stage 2 was triggered on February 29 when 100 people were infected with the virus and two died. Lockdown in the Oise Region : gatherings are forbidden, residents are asked to limit their movements and schools in the affected communes are closed.
1st March	First cases in the outre mer departments.
9th March	Gatherings of more than 1,000 people are now impossible in all France.
11th March	On March 11, the Minister of Health announces that from now on all visits to Nursing home.
12th March	French President Emmanuel Macron first televised address to the nation. Closure of of nurseries, schools, colleges, high schools and universities. Working from home encouraged.
13th March	Prime Minister Édouard Philippe ordered the closure of all non-essential public places.
15th March	The first round of municipal elections took place.
16th March	French President Emmanuel Macron announced mandatory home confinement while the Paris Stock Exchange collapses and experiences its worst sessions, surpassing the subprime crisis of 2008.
17th March	Ban on all travel except related to professional activity, to buy food and to exercise.
23rd March	Prime Minister Édouard Philippe announced the closure of the open-air markets, unless exemptions were granted by the prefects. Sporting outings or “to take your children for a walk” are now limited to a radius of 1 km and a maximum of one hour per day.
24th March	A curfew was decreed over the entire territory of Mayotte on 24 March and then over French Guiana, French Polynesia, Guadeloupe and Martinique.
27th March	On March 27, Prime Minister Édouard Philippe extends the national containment until at least April 15.
7th April	Ban of individual sports activities practised in the capital between 10 a.m. and 7 p.m.
13th April	Emmanuel Macron announced that the containment could be lifted on May 11, at least partially. The precise modalities would be announced at a later date by monitoring epidemiological indicators.
28th April	Prime Minister presented to the National Assembly the conditions for the end of the lockdown, including a staggered start to the school year, the non-resumption of face-to-face classes in higher education, the continued closure of bars, cafés and restaurants and a ban on gatherings.
11th May	The measures emblematic of the end of the lockdown, effective on 11 May, are the abolition of the exit permit, the obligation to wear a mask in transport, the resumption of work in shops, with the exception of restaurants, the very gradual start of the school year and the restriction of travel to more than 100 kilometres from one’s home.
28th May	The Prime Minister announces Phase 2 of deconfinement beginning June 2.

Analysis

In the France Timeseries, minor peaks are present around the 8th of February when cases started to rise and all travels to China were forbidden. The same happens around the 15th of February when the first death in Europe occurs in Paris. We can also notice an increase in tweet activity about COVID-19 between February 24th until March 22nd which spans the period that initial cases started rising and all major lockdown measures were adopted. The peak around March 5th could have been due to the Mulhouse COVID-19 cases coming to light.

As with the timeseries of the United Kingdom, it’s interesting that the lockdown itself does not create a great impact on Twitter usage. In fact, we can observe just minor peaks around the 16th of March when French President Emmanuel Macron announced mandatory home confinement. Perhaps it’s because, like the United Kingdom, France had already seen the lockdown coming because Italy was far ahead of it in terms of cases and deaths. It was a fairly likely measure to be taken by that time. The announcement of possible ease of the lockdown creates minor peaks as well around the 13th of April exactly when Emmanuel Macron announced that the containment could be lifted on May 11, at least partially. Once again the Tweets about Coronavirus do not have a huge increase. The reason once again could be that the deconfinement was quite expected as well since other countries, such as Italy or Austria had been announced it.

Overall we see that tweet trends are fairly similar across France. This is different from the United Kingdom where if Wales and England had very similar patterns in usage, Scotland and Northern Ireland deviated from England and Wales in some periods.

Italy

Key Events

Here are key events in Italy for the spread of the Coronavirus.

Date	Event
30th January	Two Chinese tourists are the first confirmed cases in Rome, Italy.
31st January	All the flights to and from China were suspended by the Italian Government.
11th February	Coronavirus disease named COVID-19.
14th February	Beginning of the The Lombardy outbreak came to light when a 38-year-old Italian tested positive in Codogno.
19th February	“Game Zero” - Match between Ataland and Valencia at the San Siro in Milano had an attendance of over 40,000 people.
21st February	Initial lockdown of eleven municipalities of the province of Lodi in Lombardy and 16 cases detected.
22nd February	60 cases detected in Lombardy and suspension of all sporting events in the regions of Lombardy and Veneto,
6th March	President of the Republic Sergio Mattarella speaks to the Nation
8th March	Expansion of lockdown to various northern provinces and the whole region of Lombardy.
10th March	Nationwide Lockdown.
11th March	Tightening of the lockdown. All non-essential commercial and retail businesses closed.
19th March	Army in Bergamo was deployed to transport the number of dead residents.
20th March	Ban of all outdoor exercise in Lombardy.
21st March	Further restrictions within the nationwide lockdown, by halting all non-essential production, industries and businesses.
1st April	Lockdown extended to 13th April.
10th April	Lockdown extended to 3rd of May.
26th April	Phase two announced starting 4th May.
4th May	Starting of the so-called “phase two”.
18th May	Most businesses could reopen, and free movement was granted to all citizens within their Region; movement across Regions was still banned for non-essential motives.
25th May	Swimming pools and gyms could also reopen, and on 15 June theatres and cinemas.

Analysis

Unlike France, Italy has two very prominent peaks and all regions of Italy share very similar patterns. These two peaks correspond to the initial lockdown in parts of northern Italy on February 21st followed by the nationwide lockdown on March 8th. Unlike the other regions, the twitter usage seems to be dominated by the lockdowns announcements. The start of the end of the lockdown also struck the Twitter usage : in fact, we can see minor peaks around the 4th of May, the start of the so-called phase two.

*** disclaimer: Italian Regions ‘Valle d’Aosta’ and ‘Molise’ are not present in the graphs because tweets coming from these regions were not present for each date in the dataset, resulting in a curve with value 0 most of the time.

Summary

This report only served as a high-level overview of the structure of the data, and the availability of data on the countries and regions that we wish to explore. However, even with this very coarse exploration of the data we uncovered some interesting insights.

It is remarkable how clear the focus of Twitter usage was in Italy on their lockdown and how this wasn’t such a feature in the United Kingdom or France. This may be because Italy was the front-runner in the disease outbreak in Europe and the United Kingdom and France had time to come to expect similar actions to be taken in their countries. In particular, in France, though there was a slight increase in Twitter usage during March, there weren’t any particularly high peaks in twitter usage.

We have seen that it is not always easy to correspond known key dates concerning the progression of the Coronavirus outbreak to changes in twitter behavior, at least at the level of twitter usage. There are however some clear correspondences some of the time and so this information is still useful. However, our intuitions about how twitter usage responds to key milestones in the outbreak are only a reasonable starting point.