Energy Mobility Tweets Searches

Exploring Hashtags

Michael Anslow
SONY Computer Science Lab - Paris

Hashtags are a fundamental component of tweets. They provide a set of tags that capture or expand upon the context of tweets. Understanding which hashtags are commonly used and how these hashtags occur together gives us some insight into the fundamental topics communicated by users. In this report, we explore the usage of hashtags on Twitter from a global (the entire Twitterverse) and than regional perspective. In particular, focusing on the United Kingdom, France and Italy.

Purpose of Analysis

The focus of this analysis is to provide a high level summary of key groups of hashtags. Thereby, identifying the fundamental topics (a particular hashtag group) that people communicate about. Further research needs to be conducted to refine these initial results and analysis to answer specific scientific questions.

Caveats and Data Processing

The twitter dataset was gathered using a set of particular hashtags related to the Coronavirus. These hashtags are ubiquitous across the dataset and so add very little additional context to the hashtag communities we are trying to identify. As a consequence, we removed them from the set of all hashtags. This substantially improved the results of community detection algorithm.

If hashtags do not appear among the most frequently used hashtags, it does not mean that these hashtags are not present in the dataset. They are just less frequent. There are also variations of the same concept among hashtags such as #boris and #borisjohnson that reduce the real number of unique concepts that the hashtags refer to. This is ubiquitous across the Twitterverse and so we expect that the partitions of the data into nations will be affected by this phenomena in the same way.

Global Hashtags

In the following visualisations, each node/point is a hashtag. We use Louvain Modularity to identifying communities of hashtags indicated by the colours of nodes. Those hashtags that are not in a particular group are white. The larger a node is, the more frequently used it is, while the thicker a link/edge is between two hashtags, the more frequently they co-occured in the same tweet. To make the visualisation clearer, edges between members of a community and edges between nodes that cross communities are only kept if they are above hand-picked thresholds. Otherwise, the network would just be a very dense clump of nodes.

The following visualisation shows the network of the 1000 most frequent hash tags used in the dataset and their communities. This covers the entire Twitterverse from our dataset, in all languages and locations.

Hash tag summary

(Google translate used for languages unfamiliar to the authors. Additional context would be appreciated if the above are incorrect.)

From these hashtags we can see that:

  1. Though the dataset is broadly about COVID-19, people are still discussing many other topics:
    • Hong Kong and it’s relationship with China.
    • Technology
    • Right-wing US politics
  2. The message to stay at home is ubiquitous across languages and countries.
  3. Many countries are discussed, demonstrating the global importance of the pandemic.
  4. Various particular localities together including: Europe, Africa, The United States, Latin America, The Middle East and Asia. This likely captures the presence of different Twitter communities in the dataset.

Below we explore the hash tag communities of the top 300 hashtags used in the United Kingdom, Italy and France. These communities represent groups of hash tags used in the dataset and how they co-occur in tweets within their particular regions. For each region we will identify similarities and differences.

United Kingdom

Hash tag summary

Similarities

As with the global hashtags, the message to stay at home is very pertinent, as are the variations of the hashtags for the Coronavirus and COVID-19. China and it’s international relations are present, as are mentions of various countries in which the outbreak is occurring in. Furthermore, we can see that attributing blame to national leaders is present both in the global dataset and in the United Kingdom dataset with hashtags like #trumpliesamericansdie in the global dataset and #borisresign in the United Kingdom dataset.

Differences

The most distinctive trait among the top hashtags in United kingdom are hashtags concerning the National Health Service. Among these hashtags, the shortage of sufficient personal protective equipment (PPE) is a particularly important topic. Unsurprisingly there is also a focus on the Brexit, as this has been at the forefront of British politics over the last few years.

Remarks

The presence of the hashtag #coronavirususa in the United Kingdom dataset may indicate that we have erroneously included United States tweets in the United Kingdom dataset. However, this may simply be Twitter users from the United States tweeting about the United States from within the United Kingdom. Also, people within the United Kingdom are often exposed to news from the United States and so it’s not unusual for topics relevant to the Unite States to be discussed within the United Kingdom.

Italy

Hash tag summary

Similarities

As with the global hashtags, the message to stay at home is very pertinent, as are the various variations of the hashtags for the Coronavirus and COVID-19. China and it’s international relations are present as are mentions of various countries in which the outbreak is occurring in.

Differences

It is incredible how many regions and cities are present among the Italian hashtags, along with representatives of these regions. Perhaps this is an indication of the strong regional identities of Italians themselves. This may be an artifact of the relatively recent unifcation of Italy into a single nation in 1861. This is of course pure speculation and would require very deep historical and sociological exploration to verify.

It’s also striking how much politics plays a role in the dataset given that it is primarily about the Coronavirus outbreak in Italy. Particular politicians and political events exist in all the datasets but in Italy it’s a much more prominent feature such as the two factions surrounding Giuseppe Conte and Matteo Salvini.

Remarks

The Italian hashtags are much more jumbled than the global, United Kingdom or France hashtags. That is, there are spurious hashtags within a hashtag group that are only tangentially related to the group, such as in the “Global Locations and News” hashtags which include Boris Johnson, Brexit and the news website AGI that are perhaps related to the concept of news about global locations, but that aren’t global locations themselves.

There are various potential reasons for this:

  1. The most obvious reason is that there may be a lack of available data so that the general trends of hashtag use aren’t as prominent as they should be, allowing lower probability events to seep into the top 300 hashtags.
  2. Finding hashtags groups depends on co-occurrence of hashtags. Aspects of the dataset that alter these co-occurences, leading to co-occurences between a larger number of hashtags, lead to less coherent groups of hashtags. We have already established that regions of Italy and city names in Italy are very prominent in the dataset. These can be used in various different contexts, such as the location of a football match, the, the home town of a politician, the outbreak of a Coronoavirus cluster. This may lead to less coherent hashtag groups.
  3. The effectiveness of the Louvain Modularity algorithm. However, given that the algorithm worked well on the other datasets, and different parameters were explored, it’s more likely that the dataset is the culprit.

France

Hash tag summary

Similarities

As with the global hashtags, the message to stay at home is very pertinent, as are the various variations of the hashtags for the Coronavirus and COVID-19. China and it’s international relations are present as are mentions of various countries in which the outbreak is occurring in.

Differences

The most distinctive traits of the French hashtags is the inclusion of the Hydroxychloroquine as a talking point. This is unsurprising as it was suggested by Didier Raoult, a french physician and microbiologist. Other than this, there is a stronger focus on businesses and how they are affected by the Coronavirus outbreak than in other nations.

Conclusions

Hashtag groups found using the Louvain Modularity algorithm show clear and generally coherent commonalities in all data sets. This indicates that it is a suitable algorithm for identifying groups of hashtags.

There are several important commonalities and differences in the hashtag groups across the United Kingdom, Italy and France.

The lack of certain topics in the top hashtags does not indicate that topics are not