One of the most simple but informative approaches to automatically uncovering semantic and syntactic relationships in data is via what are called Word Embeddings. These are vector representations of the vocabulary (unique words) of a body of text. The idea is to make words comparable by placing them in a large space (a high-dimensional vector space to be precise) where we can measure how close they are. This closeness captures the similarity between words. For example, let’s consider the words ‘Covid-19’ and ‘Coronavirus’, though the first is the disease and the second is the virus, they are often used interchangeably in different context and so we would expect their word embedding vectors to be similar.
Without getting into too many details, we used a Word2Vec model, where a neural network is used to learn word vectors based on a learning procedure where the network has to predict the neighbouring words of a word based on that word’s vector representation. The idea is “to know a word by the company it keeps” as Flirth said in 1957 concerning the context-dependent nature of meaning. In this case, the neural network learns a representation of a word that is useful for predicting it’s neighbourhood and words with similar neighbourhoods end up having similar vectors.
To explore the semantic content of the dataset, visualising the word embeddings by projecting high-dimensional word vectors into lower dimensions is helpful. In the following graph the embedded representation of the vocabulary for the English dataset is shown in two-dimensions. We used t-SNE as our dimensionality reduction algorithm to represent word vectors in two dimensions for this plot and subsequent plots. We’ve biased t-SNE to portray local context with more importance than global context. What this means is that the points closest to each other are similar word vectors and hence similar words, but there might not be a meaningful relationship between points that are far away from one another.
Feel free to explore the embeddings below to uncover similar words in the different dataset. Each point on the graph is the embedding for one word in the vocabulary. If you zoom into the plot by selecting a region and hover over a point, you will see the word it represents. Points are coloured by their frequency rank with highest rank (yellow) and lowest rank (blue). So the most frequent word will be rank 1 and the least frequent will be around rank 12,000. When completely zoomed out you can see rough clusters of words that are likely to be groups of similar words. There is also a semi-ring on the left-hand side of the figure that resembles an asteroid belt (to my mind at least). These are likely to be word vectors without close neighbours that t-SNE had to place somewhere.
As we said at the beginning, word embeddings are a powerful technique for exploring (semantic and syntactic) synonyms and similar words in a dataset. We decided to investigate the thirty closest words for topical words or hash-tags in English, French and Italian.
The words that we investigated are as follows :
These were chosen based on our experience of them being important terms across nations from our various forms of exploratory analysis.
We can see that the UK and Italy discuss the duration of the lockdown:
The UK and Italy also relate the lockdown to social distancing and self-isolation:
In the UK embeddings, ‘holidays’ is present, most likely related to the Easter holidays and the fear that people would break the lockdown during this period. ‘vacance’ (holiday) is also mentioned in France but in the context of ‘teletravail’ (work from home) so it is used closer to the context of work holidays.
In France, the focus is on the lockdown itself with aspects of the lockdown such as the extent of the lockdown ‘prolonger’ (prolong, most likely of the lockdown) or ‘confinement_prolong’ (prolong of the lockdown’), ‘deconfinement’ (relaxing the lockdown).
In italian embeddings, the most important focus is about the “riapertura” (the reopening of the country) after the long lockdown period. Verbs such as “uscire”, to go out, or “ripartenza”, restart also indicates the will of Italians of going back to normality after the lockdown. This is not the case for the United Kingdom or France. This could be explained by the fact that Italy was the first country seriously hit by the spread of the virus. So, if we consider the same data coming from the same period of time, it is quite natural that Italians are the first to tweet about a possible end of lockdown and the beginning of the so-called “fase due”.
Generally, over the three languages, the hash-tag #stayhome is related to other hash-tags that capture the importance of staying at home to saving lives and stop the virus. Supportive hash-tags that can instill courage are also present for the three countries.
First of all, the UK and Italy embeddings feature Easter with Easter weekend in the UK embeddings and pasqua (Easter), pasquetta (Easter Monday), buona_pasqua (Happy Easter). While in France we see a focus on healthcare workers ‘merciauxsoignants’ (thank healthcare workers) ‘soutineauxsoignants’ (support healthcare workers), we applaud (onaplaudit). We can also see in France and Italy that COVID-19 and coronavirus are closely tied to the stayathome message.
All words embeddings across the three countries have other foreign countries present indicating that they tweet about countries from around the world which is unsurprising as the coronavirus affects most of the world. The most prominent countries for each of three datasets are:
Territories of China such as Taiwan, Hong-Kong are present as well as Hubei province in China or Pekin in all the three dataset. Finally France and Italy question the death tolls (‘bilancio_vittime’ for Italian or bilan_depasse, bilan_monthe for the United Kingdom)
Across the three countries there are common words associated with school such as pupil, classroom, kid, child, teacher, exam, student ( ‘etudiant’ in French, ‘studenti’ in Italian), highschool (‘lycee’ in French and liceo’ in Italian), universities (‘universita’ in Italian), elementary school (‘ecole_primaire’ in French and ‘scuola elementare’ in Italian). Moreover, in all the three countries, non-school places are discussed such as museums (‘museo’ in Italian and ‘musee’ in French), restaurants and bars (pubs for the United Kingdom), libraries (‘bibliotheque’ in French and ‘biblioteca’ in Italian), theatres (‘teatro’ in Italian).
Among all of the word embeddings, the most similar word embeddings to test are related to the diagnostic of the virus. This is why, similar words to “test” are ‘diagnostic_test’, ‘diagnose’ (‘diagnosi’ in Italian or ‘test_diagnostic’ in French),’symptom’(‘sintomo’ in Italian or ‘symptomatique’ in French) or ‘screening’(‘sottoporre_tampone’ in Italian or ‘test_depistage’ in French). It’s also worth noting that “PCR”, the main test used to detect Coronavirus cases is present in the France and UK embeddings.
In all the three countries, there are words embeddings of news synonyms probably because most of the Tweets were tweeting news about Coronavirus in their content or by the use of hashtags.
Moreover, word embeddings of particular newspapers or news websites are present:
In all three countries there are mentions of the spread of the disease:
In both France and Italy words embeddings of forms (‘formulaire’ in French or ‘modulo’ in Italian) are present. Moreover, ‘displacement’ is also present in both languages (‘deplacement’ in French, ‘spostamento’ or ‘spostarsi’ in Italian). Besides, ‘pdf’ is found in the Italy embeddings and ‘format_numerique’ in the France embeddings. This is probably a reference to the digital forms in the two countries. Finally, ‘controls’ and ‘police’ related words embeddings are present : for example, for the France embeddings we can find ‘policenationale’(‘national police’) ‘infraction’ (‘offence’) ‘verbaliser’(‘verbalize’), ‘gendarmerie’; while in the Italians embeddings we have ‘controllo_polizia’ (‘police control’) and schedare (‘to keep a file on’).