Spatial-Temporal Word clouds

Michael Anslow
SONY Computer Science Lab - Paris

Word clouds are a ubiquitous way to capture the most frequent terms in a simple, visually appealing way. The idea is simple: words are sized according to their frequency with more frequent words being larger. In our exploration of COVID-19, we wanted to capture overall word usage in different countries over time and so we modified the concept of a word cloud to accommodate for these new dimensions. We call these spatial-temporal word clouds. To our knowledge it is the first time word clouds have been adapted to show both regional variations and temporal variations in this way.

Interpreting the Visualisations

The following spatial-temporal word clouds visualisation portray tweets from the United Kingdom, France and Italy from the earliest data we have (the end of January 2020) to the end of April. Only nouns, proper nouns and verbs are used and these words are reduced to their lemmas, so words like ‘running’ and ‘ran’ would both become ‘run’.

Groups of consecutive words (n-grams) are found using pointwise mutual information and treated as unique composite words. For example, “Easter” and “Holiday”, could merge to become “Easter Holiday”. This grouping helps to provide more context about word usage than individual words alone can. For example, the words “World”, “Mobile” and “Congress” might seem fairly unrelated when considered separately, but they could refer to “Mobile World Congress”, a large trade show that was cancelled due to the COVID-19 outbreak.

As these groups of words are found after filtering based on part of speech tags and lemmatisation, they can sometimes lose some of their original context or be grammatically incorrect. For example, “hadn’t adhered to social distancing” could reduce to “adhere social distancing”. This is typically not a problem, but can occur.

The regional and temporal aspects of the word clouds should be interpreted as follows:

Regional Variation

The area outside the country consists of words common to all regions within the country; these words are essentially what you would hear spoken about in all regions of the country. The words within a particular region are the most distinctive (but frequent) words used within that region. Or in other words, words that less common in other regions but more common in that region.

Temporal Variation

As you click on the right arrow you move forward through time by one week. Each week the spatial-temporal word cloud shows three ‘types’ of words. Words that are:

(Trending): popular new words or, existing words that are becoming more popular ( biggest brightest words )
(Staying): existing words that stay equally popular as in previous weeks ( medium sized and less bright )
(Sinking): the words that are becoming less popular ( small and dark ).

On the first week all words are new, then in subsequent weeks these changes start to show.

Details on how these word clouds are made are given at the end of the report.

Highlights

Before showing the complete word clouds, here are some highlights from the word clouds indicating particular phenomena that are common across word clouds during particular events. These examples exemplify that the spatial-temporal word clouds are identifying useful trends in the data. We highlight words related to the phenomena in magenta.

Lockdown week

Words for lockdown are seen in each country:

Italy: Quarantena (quarantine) and zone rosse (red zones are quarantined areas)
France: Confinement, #confinementtotal (total lockdown).
United Kingdom: Lockdown

The week of the lockdown birthed various hashtags encouraging people to stay at home:

Italy: #iorestoacasa (I stay at home). #andratuttobene (everything will be fine).
France #jerestechezmoi (I stay at my home).
United Kingdom: #stayathomesavelives

As well as these commonalities, there are some country specific aspects:

Italy: There is a focus on quarantined zones (zona rossa). Italy differed from France and the United Kingdom as Italy quarantined areas (red zones) in different stages rather than all at once.
France: The word ‘attestation’ is found in the France wordcloud, this refers to a document that would have to be filled out to leave home during the lockdown by people living in France.
United Kingdom: In the United Kingdom, there is a focus on the National Health Service (NHS) with hashtags like #nhsheroes and #nhscovidheroes.

Whistle-blower Dr Li Wenliang

We can see that in the UK and France, the whistle-blower (Lanceur d’alerte in French) Li Wenliang features in the first week of february.

Purpose of Analysis

Word clouds provide a very basic understanding of the composition of texts. The addition of the temporal and regional information helps to further identify pertinent aspects of the data in a visually intuitive and appealing way. It is a crude form of analysis but it helps gain an overview of the dataset and allows for visual comparison between countries.

United Kingdom

Here are the word clouds for the United Kingdom. Click left and right to scroll through the weeks available. In the first week, all words are counted as new and there is no history yet, so there is no temporal filtering of key terms.

Italy

Here are the word clouds for Italy. Click left and right to scroll through the weeks available. In the first week, all words are counted as new and there is no history yet, so there is no temporal filtering of key terms.

France

Here are the word clouds for France. Click left and right to scroll through the weeks available. In the first week, all words are counted as new and there is no history yet, so there is no temporal filtering of key terms.

Conclusion

The Spatial-Temporal word cloud representation has significant value in picking out salient terms over time, especially considering it’s conceptual simplicity. Spatial-Temporal word clouds capture a remarkable amount of information in a visually appealing representation:

Word frequency (larger words tend to be more frequent).
Word phrases (n-grams)), giving additional context beyond the word level.
Regionally distinctive words.
The global picture of what is discussed.
The directions of words: whether they are trending, staying popular or sinking in popularity.
All the goodness of a map in familiarising oneself with the locations of regions.

There are various ways in which the spatial-temporal word clouds could be improved:

There is a lot of redundancy among terms:
- A repeated concept with surface variation such as: “confinementjour1” (lockdownday1), “confinementjour2” (lockdownday2) etc.
- Spelling mistakes such as corona[r]virus, and co[r]vid.
- Missing accents such as d[é]confinement and d[e]confinement.
- Alternatives of spellings such as colour and color.
- Semantically close concepts such as ‘easing lockdown’ and ‘relaxing lockdown’.
Identifying trends at a weekly level can hide key events throughout the week that may have been important. If a particular politician is supported at the start of the week, faces a scandal, and is hated at the end of the week, we may only capture one of those events or perhaps neither.

Expanding upon the spatial-temporal word clouds presented here would be a great project for an internship here at Sony CSL. If you are interested in this, please feel free to contact us at the language team.

Constructing Spatial-Temporal Word Clouds

There is a lot of commonality between what people speak about in different regions of the same country. However, the points at which these differ can help to shed light on the individual interests and concerns of particular regions. To uncover this, the spatial temporal word clouds consider which words are proportionally more talked about in one region over another and moreover, where this gap is large (significantly more talked about).

This is accomplished by doing the following:

Take all regional words and create standard word clouds from all of them. This is shown in a common regional area, in our case, as we are looking at countries, we visualise this in the area outside of the countries. This gives an overall view of what is talked about across all regions.
Rank words for each region so that we elicit an ordering over words which could be viewed as the ‘preferred’ words for a region. This ranking is based on a value given to each word of each region which is defined later.
Starting from the most desirable (all rank 1 words) to the least desirable words (least ranked words), assign the word to the region that desires it and block other regions from acquiring it. If there is a draw, the region with the highest value for that word wins. If the values are the same, uniformly randomly select the winner.

The values assigned to words used to elicit a ranking are a weighted combination of a) the normalised word frequency and b) the normalised tf-idf, where normalisation is across regions. The purpose of the tf-idf values are to penalise common terms across time steps (weeks) so that rarer words are weighted higher than common words. We found that using normalise frequency alone would present words that were too generic.

In the following plots, we show the first 15 selected words for each region of the United Kingdom, Italy and France during their main lockdown week below. The value shown is the aggregated frequency and tf-idf value. Some words are blank due to limitations of the plotting library that we use in rendering emojis.

Keep in mind that identifying and ranking these words is one part of the algorithm for creating the spatial temporal word clouds. This is prior to incorporating the temporal aspects of our word clouds that will be detailed later.

The United Kingdom

Reassuringly, we see ‘scotland’, ‘wales’ and ‘belfast’ appearing within their nation’s winning words. It’s reasonable to assume each category discusses places within it and so it’s good that our algorithm identified this. We can also see that there is a lot of commonality in the nations of the UK with all nations focusing on the lockdown and social distancing.

Italy

Reassuringly, we see regions of Italy mentioned in their corresponding categories North (Lombardia, Milano), Centre (Abruzzo) and South (Napoli, Calabria). It’s reasonable to assume each category discusses places within it and so it’s good that our algorithm identified this. We can see that there is a lot of commonality in the nations of the UK with all nations focusing on the lockdown and social distancing.

France

Again, we see that in general, the distinctive words of regions in France include corresponding places to those regions. For example, Bourgone is in Bourgogne-France-Comté, and Paris is a prominent word in Centre-Val de Loire & Île-de-France.

Showing Temporal Variation

The regional word cloud is extended to a temporal word cloud as follows:

On the first time slice (in our case a time slice is a week) create the word cloud just based on regional variations described previously. As will become apparent, this treats every word as new and without any former history.
On subsequent time steps, keep track of the most frequent words and store the relative change for all words.
- New words are pushed to the front of the list gaining positive rank improvement. Their position in the front of the list is higher if their frequency is higher.
- Previously seen words are compared to where they were in their last rank. They can go up, down or stay roughly the same.
Word ranks are based on relative rank from positive values to negative values.
A rolling average over ranks is used to smooth out variations over time.

To aid in discriminating between the direction of change of popular words, an additional visual cue is used. Words are grouped into one of three categories and styled differently. These categories are:

Trending: Positive relative rank (words are new or increased in frequency) increases are shown in bright colours and will be larger because they are higher in rank.
Staying: Neutral relative rank (words stay roughly as frequent as before) will be in slightly darker colours and of medium sizes.
Sinking: Negative relative rank (words became less frequent) are very small and dark.

As the ranking is defined over all words and we have a limited amount of space to display words, we select n different samples from words that are Trending, Staying and Sinking as follows:

Trending: Samples taken from the highest ranked words.
Staying: Sample are taken from words closest to the median rank. These are then ordered based on frequency to ensure that more prominent words are larger than less prominent words.
Sinking: Sample are taken from words with the lowest rank. Like the staying samples, these are then ordered based on frequency to ensure that more prominent words are larger than less prominent words.

As such, the words occur on three planes of depth from close and bright, medium and dull and far and dark.

Rather than setting specific thresholds for words to classify them into trending, staying or sinking we take the inter-quartile range of relative changes. Words with values above Q3 are considered trending, values between Q1 and Q3 are considered staying and values less than Q1 are considered sinking.

Setting the thresholds in this way is really a design choice as one might want to attribute strict thresholds to the three classes of words. To, for example, be sure only words decreasing in frequency are styled as ‘sinking’. I used the above thresholds to keep a reasonable proportion of words within each category over time as this was more visually appealing and I found that it gave more useful information about the distribution of relative word frequencies.