Word clouds are a ubiquitous way to capture the most frequent terms in a simple, visually appealing way. The idea is simple: words are sized according to their frequency with more frequent words being larger. In our exploration of COVID-19, we wanted to capture overall word usage in different countries over time and so we modified the concept of a word cloud to accommodate for these new dimensions. We call these spatial-temporal word clouds. To our knowledge it is the first time word clouds have been adapted to show both regional variations and temporal variations in this way.
The following spatial-temporal word clouds visualisation portray tweets from the United Kingdom, France and Italy from the earliest data we have (the end of January 2020) to the end of April. Only nouns, proper nouns and verbs are used and these words are reduced to their lemmas, so words like ‘running’ and ‘ran’ would both become ‘run’.
Groups of consecutive words (n-grams) are found using pointwise mutual information and treated as unique composite words. For example, “Easter” and “Holiday”, could merge to become “Easter Holiday”. This grouping helps to provide more context about word usage than individual words alone can. For example, the words “World”, “Mobile” and “Congress” might seem fairly unrelated when considered separately, but they could refer to “Mobile World Congress”, a large trade show that was cancelled due to the COVID-19 outbreak.
As these groups of words are found after filtering based on part of speech tags and lemmatisation, they can sometimes lose some of their original context or be grammatically incorrect. For example, “hadn’t adhered to social distancing” could reduce to “adhere social distancing”. This is typically not a problem, but can occur.
The regional and temporal aspects of the word clouds should be interpreted as follows:
The area outside the country consists of words common to all regions within the country; these words are essentially what you would hear spoken about in all regions of the country. The words within a particular region are the most distinctive (but frequent) words used within that region. Or in other words, words that less common in other regions but more common in that region.
As you click on the right arrow you move forward through time by one week. Each week the spatial-temporal word cloud shows three ‘types’ of words. Words that are:
On the first week all words are new, then in subsequent weeks these changes start to show.
Details on how these word clouds are made are given at the end of the report.
Before showing the complete word clouds, here are some highlights from the word clouds indicating particular phenomena that are common across word clouds during particular events. These examples exemplify that the spatial-temporal word clouds are identifying useful trends in the data. We highlight words related to the phenomena in magenta.
Words for lockdown are seen in each country:
The week of the lockdown birthed various hashtags encouraging people to stay at home:
As well as these commonalities, there are some country specific aspects:
We can see that in the UK and France, the whistle-blower (Lanceur d’alerte in French) Li Wenliang features in the first week of february.
Word clouds provide a very basic understanding of the composition of texts. The addition of the temporal and regional information helps to further identify pertinent aspects of the data in a visually intuitive and appealing way. It is a crude form of analysis but it helps gain an overview of the dataset and allows for visual comparison between countries.
Here are the word clouds for the United Kingdom. Click left and right to scroll through the weeks available. In the first week, all words are counted as new and there is no history yet, so there is no temporal filtering of key terms.
Here are the word clouds for Italy. Click left and right to scroll through the weeks available. In the first week, all words are counted as new and there is no history yet, so there is no temporal filtering of key terms.
Here are the word clouds for France. Click left and right to scroll through the weeks available. In the first week, all words are counted as new and there is no history yet, so there is no temporal filtering of key terms.
The Spatial-Temporal word cloud representation has significant value in picking out salient terms over time, especially considering it’s conceptual simplicity. Spatial-Temporal word clouds capture a remarkable amount of information in a visually appealing representation:
There are various ways in which the spatial-temporal word clouds could be improved:
Expanding upon the spatial-temporal word clouds presented here would be a great project for an internship here at Sony CSL. If you are interested in this, please feel free to contact us at the language team.
There is a lot of commonality between what people speak about in different regions of the same country. However, the points at which these differ can help to shed light on the individual interests and concerns of particular regions. To uncover this, the spatial temporal word clouds consider which words are proportionally more talked about in one region over another and moreover, where this gap is large (significantly more talked about).
This is accomplished by doing the following:
The values assigned to words used to elicit a ranking are a weighted combination of a) the normalised word frequency and b) the normalised tf-idf, where normalisation is across regions. The purpose of the tf-idf values are to penalise common terms across time steps (weeks) so that rarer words are weighted higher than common words. We found that using normalise frequency alone would present words that were too generic.
In the following plots, we show the first 15 selected words for each region of the United Kingdom, Italy and France during their main lockdown week below. The value shown is the aggregated frequency and tf-idf value. Some words are blank due to limitations of the plotting library that we use in rendering emojis.
Keep in mind that identifying and ranking these words is one part of the algorithm for creating the spatial temporal word clouds. This is prior to incorporating the temporal aspects of our word clouds that will be detailed later.
Reassuringly, we see ‘scotland’, ‘wales’ and ‘belfast’ appearing within their nation’s winning words. It’s reasonable to assume each category discusses places within it and so it’s good that our algorithm identified this. We can also see that there is a lot of commonality in the nations of the UK with all nations focusing on the lockdown and social distancing.
Reassuringly, we see regions of Italy mentioned in their corresponding categories North (Lombardia, Milano), Centre (Abruzzo) and South (Napoli, Calabria). It’s reasonable to assume each category discusses places within it and so it’s good that our algorithm identified this. We can see that there is a lot of commonality in the nations of the UK with all nations focusing on the lockdown and social distancing.
Again, we see that in general, the distinctive words of regions in France include corresponding places to those regions. For example, Bourgone is in Bourgogne-France-Comté, and Paris is a prominent word in Centre-Val de Loire & Île-de-France.
The regional word cloud is extended to a temporal word cloud as follows:
To aid in discriminating between the direction of change of popular words, an additional visual cue is used. Words are grouped into one of three categories and styled differently. These categories are:
As the ranking is defined over all words and we have a limited amount of space to display words, we select n different samples from words that are Trending, Staying and Sinking as follows:
As such, the words occur on three planes of depth from close and bright, medium and dull and far and dark.
Rather than setting specific thresholds for words to classify them into trending, staying or sinking we take the inter-quartile range of relative changes. Words with values above Q3 are considered trending, values between Q1 and Q3 are considered staying and values less than Q1 are considered sinking.
Setting the thresholds in this way is really a design choice as one might want to attribute strict thresholds to the three classes of words. To, for example, be sure only words decreasing in frequency are styled as ‘sinking’. I used the above thresholds to keep a reasonable proportion of words within each category over time as this was more visually appealing and I found that it gave more useful information about the distribution of relative word frequencies.