Sentiment Analysis of Twitter Data

Sentiment Analysis of Twitter Data (saotd)

Twitter Introduction

Recent years have witnessed the rapid growth of social media platforms in which users can publish their individual thoughts and opinions (e.g., Facebook, Twitter, Google+ and several blogs). The rise in popularity of social media has changed the world wide web from a static repository to a dynamic forum for anyone to voice their opinion across the globe. This new dimension of User Generated Content opens up a new and dynamic source of insight to individuals, organizations and governments.

Social network sites or platforms, are defined as web-based services that allow individuals to:

  • Construct a public or semi-public profile within a bounded system.
  • Articulate a list of other users with whom they share a connection.
  • View and traverse their list of connections and those made by others within the system.

The nature and nomenclature of these connections may vary from site to site.

This package, saotd is focused on utilizing Twitter data due to its widespread global acceptance. Harvested data, analyzed for sentiment can provide powerful insight into a population. This insight can assist organizations, by letting them better understand their target population. This package will allow a user to acquire data using the Public Twitter Application Programming Interface (API), to obtain tweets.

The saotd package is broken down into five different phases:

  • Acquire
  • Explore
  • Topic Analysis
  • Sentiment Calculation
  • Visualization

The saotd package workflow can be observed referenced via the below image that will take and analysis from the Twitter API to through a complete analysis.

Packages

library(saotd)
library(dplyr)
library(stringr)
library(knitr)

Acquire

To explore the data manipulation functions of saotd we will use the built in dataset saotd::raw_tweets.

However is you want to acquire your own tweets, you will first have to:

  1. Create a twitter account or sign into existing account.

  2. Use your twitter login, to sign into Twitter Developers

  3. Navigate to My Applications.

  4. Fill out the new application form.

  5. Create access token.

    • Record twitter access keys and tokens

With these steps complete you now have access to the twitter API.

To acquire your own dataset of tweets you can use the saotd::tweet_acquire function and insert your consumer key, consumer secret key, access token and access secret key gained from the Twitter Developers page. You will also need to select the #hashtags you are interested in and the number of tweets requested per #hashtag.

consumer_api_key <- "XXXXXXXXXXXXXXXXXXXXXXXXX"
consumer_api_secret_key <- "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
access_token <- "XXXXXXXXXXXXXXXXXX-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
access_token_secret <- "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

hashtags <- c("#job", "#Friday", "#fail", "#icecream", "#random", "#kitten", "#airline")

tweets <- tweet_acquire(
  twitter_app = "twitter_app",
  consumer_api_key = Sys.getenv('consumer_api_key'),
  consumer_api_secret_key = Sys.getenv('consumer_api_secret_key'),
  access_token = Sys.getenv('access_token'),
  access_token_secret = Sys.getenv('access_token_secret'),
  query = "#icecream",
  num_tweets = 100,
  distinct = TRUE)

Explore

You can acquire your own data or use the dataset included with the package. We will be using the included data raw_tweets. This dataset was acquired from a Twitter US Airline Sentiment Kaggle competition, from December 2017. The dataset contains 14,487 tweets from 6 different hashtags (2,604 x #American, 2,220 x #Delta, 2,420 x #Southwest, 3,822 x #United, 2,913 x #US Airways, 504 x #Virgin America).

set.seed(4321)

data("raw_tweets")
TD <- raw_tweets %>% 
  dplyr::sample_n(size = 5000, 
                  replace = TRUE)

The first tweet of the dataset is: “@SouthwestAir I filled in the form on the website too. Darn it all. I guess I’ll just have to cross my fingers.”, and when it is cleaned and tidy’d it becomes:

TD_Tidy <- 
  saotd::tweet_tidy(
    DataFrame = TD)

TD_Tidy$Token[1:9] %>% 
  knitr::kable("html")
x
southwestair
filled
form
website
darn
guess
ill
cross
fingers

The cleaning process removes: “@”, “#” and “RT” symbols, Weblinks, Punctuation, Emojis, and Stop Words like (“the”, “of”, etc.).

We will now investigate Uni-Grams, Bi-Grams and Tri-Grams.

saotd::unigram(DataFrame = TD) %>% 
  dplyr::top_n(10) %>% 
  knitr::kable("html", caption = "Twitter data Uni-Grams")
## Selecting by n
Twitter data Uni-Grams
word n
united 1454
flight 1314
usairways 1073
americanair 930
southwestair 860
jetblue 813
cancelled 380
service 319
time 288
im 270
saotd::bigram(DataFrame = TD) %>% 
  dplyr::top_n(10) %>% 
  knitr::kable("html", caption = "Twitter data Bi-Grams")
Twitter data Bi-Grams
word1 word2 n
customer service 198
cancelled flightled 178
late flight 85
cancelled flighted 80
late flightr 52
cancelled flight 49
2 hours 40
usairways americanair 38
3 hours 34
flight booking 31
saotd::trigram(DataFrame = TD) %>% 
  dplyr::top_n(10) %>% 
  knitr::kable("html", caption = "Twitter data Tri-Grams")
Twitter data Tri-Grams
word1 word2 word3 n
NA NA NA 54
cancelled flightled flight 20
flight cancelled flightled 17
worst customer service 16
poor customer service 10
customer service rep 8
hours late flightr 8
southwestair flight cancelled 8
cancelled flighted flight 7
cancelled flightled flights 7
flight cancelled flighted 7
hours late flight 7

Now that we have the Uni-Grams we can see that canceled and flight are referring to canceled flight and may be good set of words to merge into a single term. Additionally, pet and pets could also be merged to observe more uniqueness in the data.

TD_Merge <- 
  merge_terms(
    DataFrame = TD, 
    term = "cancelled flight", 
    term_replacement = "cancelled_flight")

Now that the terms have been merged, the new N-Grams are re-computed.

saotd::unigram(DataFrame = TD_Merge) %>% 
  dplyr::top_n(10) %>% 
  knitr::kable("html", caption = "Twitter data Uni-Grams")
Twitter data Uni-Grams
word n
united 1454
flight 1265
usairways 1073
americanair 930
southwestair 860
jetblue 813
service 319
time 288
im 270
customer 263
saotd::bigram(DataFrame = TD_Merge) %>% 
  dplyr::top_n(10) %>% 
  knitr::kable("html", caption = "Twitter data Bi-Grams")
Twitter data Bi-Grams
word1 word2 n
customer service 198
late flight 85
late flightr 52
2 hours 40
usairways americanair 38
3 hours 34
flight booking 31
gate agent 29
united im 26
usairways flight 23
saotd::trigram(DataFrame = TD_Merge) %>% 
  dplyr::top_n(10) %>% 
  knitr::kable("html", caption = "Twitter data Tri-Grams")
Twitter data Tri-Grams
word1 word2 word3 n
NA NA NA 54
worst customer service 16
poor customer service 10
customer service rep 8
hours late flightr 8
hours late flight 7
30 min late 6
cent latinasciilatinasciilatinascii cent 6
customer service desk 6
jetblue flight delayed 6
min late flight 6
southwestair flight cancelledflightled 6

Now we can look at Bi-Gram Networks.

TD_Bigram <- saotd::bigram(DataFrame = TD_Merge)

saotd::bigram_network(
  BiGramDataFrame = TD_Bigram,
  number = 30,
  layout = "fr",
  edge_color = "blue",
  node_color = "black",
  node_size = 3,
  set_seed = 1234)

Additionally we can observe the Correlation Network.

TD_Corr <- 
  saotd::word_corr(
    DataFrameTidy = TD_Tidy, 
    number = 100, 
    sort = TRUE)

saotd::word_corr_network(
  WordCorr = TD_Corr, 
  Correlation = .1, 
  layout = "fr", 
  edge_color = "blue", 
  node_color = "black", 
  node_size = 1)

Sentiment Calculation

Now that the data has been explored we will need to compute the Sentiment scores for the hashtags.

TD_Scores <- 
  saotd::tweet_scores(
    DataFrameTidy = TD_Tidy,
    HT_Topic = "hashtag")

With the scores computed we can then observe the positive and negative words within the dataset.

saotd::posneg_words(
  DataFrameTidy = TD_Tidy, 
  num_words = 10)
## Selecting by n

As an example we can see that the negative term “fail” is dwarfing all other responses. If we would like to remove “fail” we can easily do it.

saotd::posneg_words(
  DataFrameTidy = TD_Tidy, 
  num_words = 10, 
  filterword = "fail")
## Selecting by n

We can see the most positive tweets hashtags within the the data set.

saotd::tweet_max_scores(
  DataFrameTidyScores = TD_Scores,
  HT_Topic = "hashtag")
## # A tibble: 6 × 10
##   text    method hashtags created_at key   negative positive TweetSentimentScore
##   <chr>   <chr>  <chr>    <chr>      <chr>    <dbl>    <dbl>               <dbl>
## 1 @Ameri… Bing   American 2015-02-2… polp…        0       12                  12
## 2 @South… Bing   Southwe… 2015-02-1… waln…        0       10                  10
## 3 @South… Bing   Southwe… 2015-02-2… Nico…        0        9                   9
## 4 @South… Bing   Southwe… 2015-02-2… Walt…        0        9                   9
## 5 @unite… Bing   United   2015-02-2… Core…        0        9                   9
## 6 @JetBl… Bing   Delta    2015-02-2… Dres…        0        6                   6
## # ℹ 2 more variables: TweetSentiment <chr>, date <date>

We can also see the most negative hashtag tweets within the data set.

saotd::tweet_min_scores(
  DataFrameTidyScores = TD_Scores,
  HT_Topic = "hashtag")
## # A tibble: 6 × 10
##   text    method hashtags created_at key   negative positive TweetSentimentScore
##   <chr>   <chr>  <chr>    <chr>      <chr>    <dbl>    <dbl>               <dbl>
## 1 @JetBl… Bing   Delta    2015-02-1… Grac…       10        0                 -10
## 2 @USAir… Bing   US Airw… 2015-02-1… thec…        9        0                  -9
## 3 @USAir… Bing   US Airw… 2015-02-2… lj_v…        9        0                  -9
## 4 @JetBl… Bing   Delta    2015-02-1… Cure…        8        0                  -8
## 5 @South… Bing   Southwe… 2015-02-2… Dead…        8        0                  -8
## 6 @unite… Bing   United   2015-02-2… mace…        8        0                  -8
## # ℹ 2 more variables: TweetSentiment <chr>, date <date>

Furthermore if we wanted to observe the most positive or negative hashtag scores associated with a specific hashtag we could also do that.

saotd::tweet_max_scores(
  DataFrameTidyScores = TD_Scores, 
  HT_Topic = "hashtag", 
  HT_Topic_Selection = "United")
## # A tibble: 6 × 10
##   text    method hashtags created_at key   negative positive TweetSentimentScore
##   <chr>   <chr>  <chr>    <chr>      <chr>    <dbl>    <dbl>               <dbl>
## 1 @unite… Bing   United   2015-02-2… Core…        0        9                   9
## 2 @unite… Bing   United   2015-02-1… vmnk…        0        6                   6
## 3 @unite… Bing   United   2015-02-1… sash…        0        4                   4
## 4 @unite… Bing   United   2015-02-1… SFWW…        0        4                   4
## 5 @unite… Bing   United   2015-02-1… mcho…        2        6                   4
## 6 @unite… Bing   United   2015-02-1… Greg…        0        4                   4
## # ℹ 2 more variables: TweetSentiment <chr>, date <date>

Topic Analysis

If we were interested in conducting a topic analysis on the tweets we would then determine the number of latent topics within the tweet data.

saotd::number_topics(
  DataFrame = TD, 
  num_cores = 4L, 
  min_clusters = 2, 
  max_clusters = 12, 
  skip = 1, 
  set_seed = 1234)

The number of topics plot shows that between 5 and 7 latent topics reside within the dataset. For this example we could select between 5 and 7 topics to categorize this data. In this case 5 topics will be selected to continue the analysis.

TD_Topics <- 
  saotd::tweet_topics(
    DataFrame = TD, 
    clusters = 5, 
    method = "Gibbs", 
    set_seed = 1234, 
    num_terms = 10)

In a markdown product the topics table does not print clearly, unlike when it is printed in the console. However the words associated with each topic can be observed in the below table.

Number Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
1 united usairways americanair southwestair flight
2 service time usairways jetblue cancelled
3 customer plane amp im hours
4 dont gate hold virginamerica flights
5 bag jetblue call guys 2
6 check hour phone fly delayed
7 luggage waiting wait airline flightled
8 dm delay ive flying late
9 lost people cange seat 3
10 worst minutes day love weather

One of the challenges of using a topic model is selecting the correct number of topics. As we can see in the above chart. We went from 6 hashtags to 5 different topics.

While this may not be the best example to use, we will continue the topic modeling example. We would first want to rename the topics into something that would make sense. In this case Topic 1 could be luggage, Topic 2 could be delay, Topic 3 could be customer_service, Topic 4 could be enjoy, and Topic 5 could be delay These topics were chosen by observing the words associated with each topic. This selection could be different depending on experience and a deeper understanding of the topics.

We would then want to rename the topics in the dataframe

TD_Topics <- TD_Topics %>% 
  dplyr::mutate(Topic = stringr::str_replace_all(Topic, "^1$", "luggage")) %>% 
  dplyr::mutate(Topic = stringr::str_replace_all(Topic, "^2$", "gate_delay")) %>% 
  dplyr::mutate(Topic = stringr::str_replace_all(Topic, "^3$", "customer_service")) %>% 
  dplyr::mutate(Topic = stringr::str_replace_all(Topic, "^4$", "enjoy")) %>% 
  dplyr::mutate(Topic = stringr::str_replace_all(Topic, "^5$", "other_delay"))

Next we would want to tidy and then score the new topic dataset.

TD_Topics_Tidy <- 
  saotd::tweet_tidy(
    DataFrame = TD_Topics)

TD_Topics_Scores <- 
  saotd::tweet_scores(
    DataFrameTidy = TD_Topics_Tidy,
    HT_Topic = "topic")

We can see the most positive topic tweets within the data set.

saotd::tweet_max_scores(
  DataFrameTidyScores = TD_Topics_Scores,
  HT_Topic = "topic")
## # A tibble: 6 × 10
##   text       method Topic created_at key   negative positive TweetSentimentScore
##   <chr>      <chr>  <chr> <chr>      <chr>    <dbl>    <dbl>               <dbl>
## 1 @American… Bing   lugg… 2015-02-2… polp…        0       12                  12
## 2 @Southwes… Bing   lugg… 2015-02-1… waln…        0       10                  10
## 3 @Southwes… Bing   lugg… 2015-02-2… Nico…        0        9                   9
## 4 @Southwes… Bing   lugg… 2015-02-2… Walt…        0        9                   9
## 5 @united W… Bing   lugg… 2015-02-2… Core…        0        9                   9
## 6 @JetBlue … Bing   enjoy 2015-02-2… Dres…        0        6                   6
## # ℹ 2 more variables: TweetSentiment <chr>, date <date>

We can also see the most negative topics tweets within the data set.

saotd::tweet_min_scores(
  DataFrameTidyScores = TD_Topics_Scores,
  HT_Topic = "topic")
## # A tibble: 6 × 10
##   text       method Topic created_at key   negative positive TweetSentimentScore
##   <chr>      <chr>  <chr> <chr>      <chr>    <dbl>    <dbl>               <dbl>
## 1 @JetBlue … Bing   enjoy 2015-02-1… Grac…       10        0                 -10
## 2 @USAirway… Bing   gate… 2015-02-1… thec…        9        0                  -9
## 3 @USAirway… Bing   cust… 2015-02-2… lj_v…        9        0                  -9
## 4 @JetBlue … Bing   enjoy 2015-02-1… Cure…        8        0                  -8
## 5 @Southwes… Bing   cust… 2015-02-2… Dead…        8        0                  -8
## 6 @united y… Bing   cust… 2015-02-2… mace…        8        0                  -8
## # ℹ 2 more variables: TweetSentiment <chr>, date <date>

Furthermore if we wanted to observe the most positive or negative scores associated with a specific topic we could also do that.

saotd::tweet_max_scores(
  DataFrameTidyScores = TD_Topics_Scores,
                        HT_Topic = "topic",
                        HT_Topic_Selection = "luggage")
## # A tibble: 6 × 10
##   text       method Topic created_at key   negative positive TweetSentimentScore
##   <chr>      <chr>  <chr> <chr>      <chr>    <dbl>    <dbl>               <dbl>
## 1 @American… Bing   lugg… 2015-02-2… polp…        0       12                  12
## 2 @Southwes… Bing   lugg… 2015-02-1… waln…        0       10                  10
## 3 @Southwes… Bing   lugg… 2015-02-2… Nico…        0        9                   9
## 4 @Southwes… Bing   lugg… 2015-02-2… Walt…        0        9                   9
## 5 @united W… Bing   lugg… 2015-02-2… Core…        0        9                   9
## 6 @Southwes… Bing   lugg… 2015-02-2… woaw…        0        6                   6
## # ℹ 2 more variables: TweetSentiment <chr>, date <date>

Visualizations

Hashtags

Now we will begin visualizing the hashtag data. The distribution of the sentiment scores can be found in the below plot.

saotd::tweet_corpus_distribution(
  DataFrameTidyScores = TD_Scores, 
  color = "black", 
  fill = "white")

Additionally if we wanted to see the score distributions per each hashtag, we can find it below.

saotd::tweet_distribution(
  DataFrameTidyScores = TD_Scores, 
  HT_Topic = "hashtag", 
  bin_width = 1, 
  color = "black", 
  fill = "white")

We can also observe the hashtag distributions as a Box plot.

saotd::tweet_box(
  DataFrameTidyScores = TD_Scores, 
  HT_Topic = "hashtag")

Also as a Violin plot. The chevrons in each violin plot denote the median of the data and provide a quick reference point to see if a hashtag is generally positive or negative. For example the “random” hashtag has a generally negative sentiment, where as the “kitten” hashtags has a generally positive sentiment.

saotd::tweet_violin(
  DataFrameTidyScores = TD_Scores,
  HT_Topic = "hashtag")

One of the more interesting ways to visualize the Twitter data is to observe the change in sentiment over time. This dataset was acquired on a single day and therefore some of the hashtags did not overlap days. However some did and we can see the change in sentiment scores through time.

saotd::tweet_time(
  DataFrameTidyScores = TD_Scores,
  HT_Topic = "hashtag")

Finally if a Twitter user has not disabled georeferencing data the location of the tweet can be observed. However in many cases this may not be very insightful because of the lack of data.

Topics

Now we will begin visualizing the topic data. The distribution of the sentiment scores can be found in the below plot.

saotd::tweet_corpus_distribution(
  DataFrameTidyScores = TD_Topics_Scores, 
  color = "black", 
  fill = "white")

Additionally if we wanted to see the score distributions per each topic, we can find it below.

saotd::tweet_distribution(
  DataFrameTidyScores = TD_Topics_Scores, 
  HT_Topic = "topic", 
  bin_width = 1, 
  color = "black", 
  fill = "white")

We can also observe the topic distributions as a Box plot.

saotd::tweet_box(
  DataFrameTidyScores = TD_Topics_Scores,
  HT_Topic = "topic")

Also as a Violin plot. The chevrons in each violin plot denote the median of the data and provide a quick reference point to see if a hashtag is generally positive or negative. For example the “random” hashtag has a generally negative sentiment, where as the “kitten” hashtags has a generally positive sentiment.

saotd::tweet_violin(
  DataFrameTidyScores = TD_Topics_Scores,
  HT_Topic = "topic")

One of the more interesting ways to visualize the Twitter data is to observe the change in sentiment over time. This dataset was acquired on a single day and therefore some of the hashtags did not overlap days. However some did and we can see the change in sentiment scores through time.

saotd::tweet_time(
  DataFrameTidyScores = TD_Topics_Scores,
  HT_Topic = "topic")