Exploring Synonyms with Word Embeddings

Martina Galletti
SONY Computer Science Lab - Paris

One of the most simple but informative approaches to automatically uncovering semantic and syntactic relationships in data is via what are called Word Embeddings. These are vector representations of the vocabulary (unique words) of a body of text. The idea is to make words comparable by placing them in a large space (a high-dimensional vector space to be precise) where we can measure how close they are. This closeness captures the similarity between words. For example, let’s consider the words ‘Covid-19’ and ‘Coronavirus’, though the first is the disease and the second is the virus, they are often used interchangeably in different context and so we would expect their word embedding vectors to be similar.

Without getting into too many details, we used a Word2Vec model, where a neural network is used to learn word vectors based on a learning procedure where the network has to predict the neighbouring words of a word based on that word’s vector representation. The idea is “to know a word by the company it keeps” as Flirth said in 1957 concerning the context-dependent nature of meaning. In this case, the neural network learns a representation of a word that is useful for predicting it’s neighbourhood and words with similar neighbourhoods end up having similar vectors.

To explore the semantic content of the dataset, visualising the word embeddings by projecting high-dimensional word vectors into lower dimensions is helpful. In the following graph the embedded representation of the vocabulary for the English dataset is shown in two-dimensions. We used t-SNE as our dimensionality reduction algorithm to represent word vectors in two dimensions for this plot and subsequent plots. We’ve biased t-SNE to portray local context with more importance than global context. What this means is that the points closest to each other are similar word vectors and hence similar words, but there might not be a meaningful relationship between points that are far away from one another.

Feel free to explore the embeddings below to uncover similar words in the different dataset. Each point on the graph is the embedding for one word in the vocabulary. If you zoom into the plot by selecting a region and hover over a point, you will see the word it represents. Points are coloured by their frequency rank with highest rank (yellow) and lowest rank (blue). So the most frequent word will be rank 1 and the least frequent will be around rank 12,000. When completely zoomed out you can see rough clusters of words that are likely to be groups of similar words. There is also a semi-ring on the left-hand side of the figure that resembles an asteroid belt (to my mind at least). These are likely to be word vectors without close neighbours that t-SNE had to place somewhere.

1. United Kingdom:

2. France:

3. Italy:

What did we use it for ?

As we said at the beginning, word embeddings are a powerful technique for exploring (semantic and syntactic) synonyms and similar words in a dataset. We decided to investigate the thirty closest words for topical words or hash-tags in English, French and Italian.

The words that we investigated are as follows :

Lockdown (UK), Confinement (France) and Quarantena (Italy)
Stayathome (UK), resterchezvous (France) and Iorestoacasa (Italy)
China (UK), Chine (France) and Cina (Italy)
School (UK), École (France) and Scuola (Italy)
Test (UK, France, Italy)
Coronavirus (UK, France, Italy)
Declaration of your planned activity (“Attestation” for French and “Autodichiarazione” for Italian). This is a form of document required by citizens to leave their homes which was present in Italy and France, but not in the United Kingdom.

These were chosen based on our experience of them being important terms across nations from our various forms of exploratory analysis.

1. Lockdown - most similar words

1.1. Similarities

We can see that the UK and Italy discuss the duration of the lockdown:

UK: ‘lockdown_lift’, ‘easter’
France: ‘periode_confinement’ (lockdown period) , ‘confinement_prolong’ (lockdown extension)

The UK and Italy also relate the lockdown to social distancing and self-isolation:

UK: ‘self isolation’, ‘stayhome’ and ‘socialdistancing’
Italy: ‘isolamento’ (isolation), ‘isolamento_volontario’ (self isolation) and ‘autoisolmento’ (self isolation)

1.2. Some peculiarities

In the UK embeddings, ‘holidays’ is present, most likely related to the Easter holidays and the fear that people would break the lockdown during this period. ‘vacance’ (holiday) is also mentioned in France but in the context of ‘teletravail’ (work from home) so it is used closer to the context of work holidays.

In France, the focus is on the lockdown itself with aspects of the lockdown such as the extent of the lockdown ‘prolonger’ (prolong, most likely of the lockdown) or ‘confinement_prolong’ (prolong of the lockdown’), ‘deconfinement’ (relaxing the lockdown).

In italian embeddings, the most important focus is about the “riapertura” (the reopening of the country) after the long lockdown period. Verbs such as “uscire”, to go out, or “ripartenza”, restart also indicates the will of Italians of going back to normality after the lockdown. This is not the case for the United Kingdom or France. This could be explained by the fact that Italy was the first country seriously hit by the spread of the virus. So, if we consider the same data coming from the same period of time, it is quite natural that Italians are the first to tweet about a possible end of lockdown and the beginning of the so-called “fase due”.

2. Stayathome - most similar words

2.1. Similarities

Generally, over the three languages, the hash-tag #stayhome is related to other hash-tags that capture the importance of staying at home to saving lives and stop the virus. Supportive hash-tags that can instill courage are also present for the three countries.

UK: ‘stayhomestaysafe’, ‘stayhomesavelives’,’flattenthecurve’
France: stay at home (‘restezchezvous’),’prendre_soin’(‘take care’), ‘protegezvous’ (protect yourself)
Italy: ‘andratuttobene’ (‘everything will go all right’),’iostocases’(‘I stay home’),’fermiamoloinsieme’(‘we can defeat it together’)

2.2. Some peculiarities

First of all, the UK and Italy embeddings feature Easter with Easter weekend in the UK embeddings and pasqua (Easter), pasquetta (Easter Monday), buona_pasqua (Happy Easter). While in France we see a focus on healthcare workers ‘merciauxsoignants’ (thank healthcare workers) ‘soutineauxsoignants’ (support healthcare workers), we applaud (onaplaudit). We can also see in France and Italy that COVID-19 and coronavirus are closely tied to the stayathome message.

3. China - most similar words

3.1. Similarities

All words embeddings across the three countries have other foreign countries present indicating that they tweet about countries from around the world which is unsurprising as the coronavirus affects most of the world. The most prominent countries for each of three datasets are:

UK: the United States, Italy, France, South Korea, Japan, Russia and Iran.
France: Italy, South Korea, Russie,the United States.
Italy: Italy, Japan, Russia, South Korea, Asia.

Territories of China such as Taiwan, Hong-Kong are present as well as Hubei province in China or Pekin in all the three dataset. Finally France and Italy question the death tolls (‘bilancio_vittime’ for Italian or bilan_depasse, bilan_monthe for the United Kingdom)

3.2. Some peculiarities

UK: import and export are mentioned, perhaps related to the close trading relationship between the United Kingdom and China.
France: Afp, Sars and Oms are also present.
Italy: Italians seems concerned about flights and airplane.

4. School - most similar words

4.1. Similarities

Across the three countries there are common words associated with school such as pupil, classroom, kid, child, teacher, exam, student ( ‘etudiant’ in French, ‘studenti’ in Italian), highschool (‘lycee’ in French and liceo’ in Italian), universities (‘universita’ in Italian), elementary school (‘ecole_primaire’ in French and ‘scuola elementare’ in Italian). Moreover, in all the three countries, non-school places are discussed such as museums (‘museo’ in Italian and ‘musee’ in French), restaurants and bars (pubs for the United Kingdom), libraries (‘bibliotheque’ in French and ‘biblioteca’ in Italian), theatres (‘teatro’ in Italian).

4.2. Some peculiarities

UK: Specific places are mentioned such as Edinburgh or Sussex. Moreover, the presence of ‘voucher’ is quite significant : in fact, children eligible to free school meals before the lockdown were able to to maintain their eligibility using vouchers for local shop or supermarket.
France: May and “rentree” (back-to-school season) are unsurprisingly present since France introduced the return to School starting early May 2020 for some regions.
Italy: Azzolina, minister of eduction is present.

Person	Role
Jean-Michel Blanquer	Minister of Education in France
Lucia Azzolina	Minister of Education in Italy

5. Test - most similar words

5.1. Similarities

Among all of the word embeddings, the most similar word embeddings to test are related to the diagnostic of the virus. This is why, similar words to “test” are ‘diagnostic_test’, ‘diagnose’ (‘diagnosi’ in Italian or ‘test_diagnostic’ in French),’symptom’(‘sintomo’ in Italian or ‘symptomatique’ in French) or ‘screening’(‘sottoporre_tampone’ in Italian or ‘test_depistage’ in French). It’s also worth noting that “PCR”, the main test used to detect Coronavirus cases is present in the France and UK embeddings.

5.2. Some peculiarities

UK: ‘Testing_kit’ appears since in the United Kingdom, news reported that on the internet fake COVID-19 testing kits were sold on the internet.
France: In the France embeddings, there is the reference to the chloroquine which is a medication used to cure malaria which was indicated by some doctors as a possible cure to Coronavirus. In France, this caused a debate in the press starting in the middle of April. While in Italy, there is also a reference to an experimental vaccine (‘vaccinare_sperimentale’) which is not the case for the other embeddings. Moreover, Tocilizumab is also present which isa n active ingredient indicated in France as a possible treatment for Coronavirus, despite serious adverse effects are been studied.
Italy: Spallanzani refers to the Spallanzani Hospital in Rome, which specializes in infectious diseases which isolated the SARS-CoV-2 Molecular characterization from the first case.

6. Coronavirus - most similar words

6.1. Similarities

In all the three countries, there are words embeddings of news synonyms probably because most of the Tweets were tweeting news about Coronavirus in their content or by the use of hashtags.

UK: ‘new’,’live_update’
France: ‘flash’,’actualite’(update),’direct’(breaking news)
Italy: ‘news’, ‘cronaca’(news),’ultimora’(breaking news)

Moreover, word embeddings of particular newspapers or news websites are present:

UK: mailonline.
France: afp,lemondefr.
Italy: agi repubblica.

In all three countries there are mentions of the spread of the disease:

UK: ‘disease’, ‘spread’ and ‘virus’
France: ‘cas’, ‘bilan_salourdit’/’bilan_monte’ (death/cases thools getting heavier).
Italy: ‘contagiare’ (to infect), ‘salire_numerare’ (numbers increasing)

6.2. Some peculiarities

France: one of the synonyms of Coronavirus is ‘apres’(after). This could be because people are wondering about what world after Coronavirus will be like.

7. Movement certificate - most similar words

7.1. Similarities

In both France and Italy words embeddings of forms (‘formulaire’ in French or ‘modulo’ in Italian) are present. Moreover, ‘displacement’ is also present in both languages (‘deplacement’ in French, ‘spostamento’ or ‘spostarsi’ in Italian). Besides, ‘pdf’ is found in the Italy embeddings and ‘format_numerique’ in the France embeddings. This is probably a reference to the digital forms in the two countries. Finally, ‘controls’ and ‘police’ related words embeddings are present : for example, for the France embeddings we can find ‘policenationale’(‘national police’) ‘infraction’ (‘offence’) ‘verbaliser’(‘verbalize’), ‘gendarmerie’; while in the Italians embeddings we have ‘controllo_polizia’ (‘police control’) and schedare (‘to keep a file on’).

7.2. Some peculiarities

Italy: ‘violare_quarantena’(‘breach quarantine’) is present while it’s not for present for French. There are also some italian specific words such as ‘dpcm_marzo’ (March Decree of the President of the Council of Ministers) and ‘viminale’ (‘home to the Ministry of the Interior’).