In my last blog post I referred to the use of a word cloud as a way of making data more visually appealing and ‘fun’. Today I’ll run through the process I use to generate word clouds and hopefully some of you will enjoy creating your own.
Why a word cloud?
Word clouds can be useful tools for identifying sentiment within a dataset, or as in my previous example, to summarise the Twitter discussion for a certain hashtag. Businesses will often find that when customers tweet about a product or a service they can gather useful feedback which is both interesting to customers and provides managers with a way of tracking the quality of the service which is provided. Famous TV advertising campaigns, like those Christmas campaigns by John Lewis, Tesco and Sainsbury’s now start online, and by analysing online reaction the companies investing in these expensive campaigns can begin to understand customer engagement before moving to mainstream media. A simple yet effective way to do this is to create word clouds of the online discussion which could include tens of thousands of Tweets or Facebook comments.
Generating a word cloud using R
So how is the word cloud actually generated? There are a number of visualisation tools such as Tableau that make generating word clouds easy. Programming them is not too difficult either if you can use the right tools. Here I explain how to generate a word cloud for a specific Twitter hashtag using R. I’ve briefly summarised each step involved and then have provided the code used as a download at the bottom of the page.
The first step is to acquire the tweets you are interested in. For this an application needs to be registered with Twitter and then a set of credentials are supplied for the code to access the Twitter feeds. The consumer and access tokens should be added to the application as part of the authentication process. In the example I’ve pulled the last 1500 Tweets, in English, using the hashtag #talkbigdata. However you could of course select your Tweets in other ways.
Once you have your list of Tweets there will be many words you’ll want to exclude. This will include words such as “and”, “the”, “this” and “how” which aren’t likely to mean a great deal to the person viewing your word cloud, as well as URLs. You’ll also want to remove characters including punctuation.
Once you’ve got your final list it’s time to tidy it up. I always convert all the words to lower case to ensure words standout because of their relative importance (indicated in a word cloud by its size) rather than because of the letter case used. Then using your final list of words you need to create a frequency table to indicate how many times the word appears in your dataset and therefore how large it should appear in the word cloud.
Finally, the word cloud function within R will use your data to produce an output.
Access the code
You can view the code I used to create the word cloud on this page as a download.
Although the sample code does provide a sample word cloud it must be stressed that there is more than one way to write the code. There are a range of alternative toolkits within R that may provide a more complete application. One of those toolkits is the Natural Language Processing library that may make the identification or filtering of relevant words more robust and flexible than the simplistic example I’ve provided.
The whole premise behind this blog post is to show just how easy and accessible the right tool can make the creation of a visualisation of big data.
Published 18 November 2016