Sentiment Analysis on Twitter Data

In this article, I'll go through the steps of carrying out sentiment analysis. Sentiment analysis is a natural language processing (NLP) method that can be used to make assessments about the positive, negative, or neutral values of data. Sentiment analysis is often performed on textual data like customer reviews and tweets for example to help businesses monitor how their customer/userbase feels about their product and thus make adjustments to their products, brand as needed. I'll show the process of tokenizing data, getting collection frequency plots, preparing the data for modeling, visualizing the dataset with wordclouds, and using VADER for polarity classification.

Rough Outline for Sentiment Analysis using NLP

Installing NLTK and Downloading the Data
Tokenizing the Data
Normalizing the Data
Removing Noise from the Data
- Converting Tokens to a Dictionary
- Splitting the Dataset for Training and Testing the Model
Determining Word Density
Preparing Data for the Model
- Contructing dictionary
- Splitting data into training and testing sets
Building and Testing the Model
Put tweets into dataframe (this could be done earlier)
Get polarity scores via VADER
Data visualizations (Word Clouds, frequency plots, ...)

Modules used in this work

The following modules are used in sentiment analysis applications.

nltk : Used for natural language processing.
re : Used for regular expression operations.
string : Used for common string operations

The API documentation for each of these modules can be found here:

In [224]:

#Stuff for general sentiment analysis
import nltk
from nltk.corpus import twitter_samples,stopwords
from nltk.tag import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk import FreqDist, classify, NaiveBayesClassifier

#Stuff for VADER
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

#Stuff for dealing with strings and regular expressions
import re, string, random

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\vmurc\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!

Step 1: Downloading data from NLTK

This section will delineate the basic steps of loading and preliminary inspection of our data using NLTK.

Loading data with NLTK

The first thing that needs to be done is load the data onto our jupyter notebook. NLTK has a download method from which we can extract a variety of data from a variety of sources. You can find the complete list of datasets available to NLTK here: https://www.nltk.org/nltk_data/

Here I'll be downloading the datasets shown below

In [2]:

#Downloading the data
nltk.download('twitter_samples') #30000 tweets. 5000 positive, 5000 negative. Rest are neutral.
nltk.download('punkt') #Pretrained model to tokenize words
nltk.download('wordnet') #Lexical database to help determine base word
nltk.download('averaged_perceptron_tagger') #Used to determine context of word in sentence
nltk.download('omw-1.4')
nltk.download('stopwords')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\vmurc\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\vmurc\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\vmurc\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\vmurc\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\vmurc\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\vmurc\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Out[2]:

True

Step 2 — Tokenizing the Data

Now that we have downloaded the data we want, we have to tokenize it.

The process of tokenizing splits strings into smaller parts called tokens. Think of a token as the smallest unit of text that can be parsed by the computer (analogues to phonemes and morphemes). This is done because computers don't know how to accurately process language in its original form.

Tokens can be any sequence of words, emojis, hashtags, links, or an individual character.

One of the datasets that was download was the "twitter_samples" dataset which is made up of 30,000 tweets. 5000 tweets have negative sentiments associated with them, 5000 tweets have positive sentiments associated with them, and 20000 tweets have no sentiments associated with them. The data associated with each of these 3 sentiments (positive, negative, neutral) will be stored in the variables below:

In [3]:

positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')

Before tokenizing the data, we need a pre-trained model that will our subsequent tokenizing efforts. In this example, the pre-trained model will be the "punkt" resource which was downloaded earlier. Now we can tokenize the data using the tokenized method which produces the array below

In [371]:

tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
print(tweet_tokens[:5])

[['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':)'], ['@Lamb2ja', 'Hey', 'James', '!', 'How', 'odd', ':/', 'Please', 'call', 'our', 'Contact', 'Centre', 'on', '02392441234', 'and', 'we', 'will', 'be', 'able', 'to', 'assist', 'you', ':)', 'Many', 'thanks', '!'], ['@DespiteOfficial', 'we', 'had', 'a', 'listen', 'last', 'night', ':)', 'As', 'You', 'Bleed', 'is', 'an', 'amazing', 'track', '.', 'When', 'are', 'you', 'in', 'Scotland', '?', '!'], ['@97sides', 'CONGRATS', ':)'], ['yeaaaah', 'yippppy', '!', '!', '!', 'my', 'accnt', 'verified', 'rqst', 'has', 'succeed', 'got', 'a', 'blue', 'tick', 'mark', 'on', 'my', 'fb', 'profile', ':)', 'in', '15', 'days']]

This tokenization returns the content of a bunch of tweets from that dataset as an array. As can be seen, the array is composed of a bunch of words and emojis. In addition to this, many of these entries also have unwanted characters within them (i.e., @ and _ ). These characters will need to be removed as part of the data cleanup.

Step 3 — Normalizing the Data

Words can be expressed in a variety of ways to convey meaning. For instance, ate and eaten are different tenses of the verb eat. An analysis may require these different tenses to be converted into their base, fundamental, canonical form. This process is known as normalization.

Normalization is helpful because it allows us to group words together that have the same overarching meaning but have different written forms. You can think of the normalization process of a sort of clustering method but applied to words. Two methods that are commonly employed for this are stemming and lemmatization. The differences between the two are shown in the image below:

In stemming we are removing affixes. An affix is a morpheme that is added to a word to change its meaning. Common affixes are prefixes (an affix that is added to the beginning of a word) and suffixes (an affix that is added at the end of a word). Stemming is useful when dealing with simple verb forms and is thus a bit of a crude heuristic process.

In lemmatization, the algorithm normalizes a word within the context of vocabulary and through the use of morphological analysis in order to produce a lemma. A lemma in morphology is known as the canonical or dictionary form of a set of word forms. This lemmatization process allows different words to be indexed,mapped, or traced back to a single word. For instance, speak, speaks, spoke, spoken and speaking are all forms of speak (known as the lexeme).

As is the case with algorithms, one must always consider the tradeoffs between speed and accuracy. Stemming is generally faster than lemmatization but is also less accurate. Lemmatization is slower but generates more accurate results. I'll be using lemmatization in this demo.

Now that normalization schemes have been established let's determine the context of each word in the text. This process will give a tag (tagging algorithm) to each word that allows words to be related to one another. This is done via the pos_tag function in nltk.

In [5]:

tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
print(pos_tag(tweet_tokens[0]))

[('#FollowFriday', 'JJ'), ('@France_Inte', 'NNP'), ('@PKuchly57', 'NNP'), ('@Milipol_Paris', 'NNP'), ('for', 'IN'), ('being', 'VBG'), ('top', 'JJ'), ('engaged', 'VBN'), ('members', 'NNS'), ('in', 'IN'), ('my', 'PRP$'), ('community', 'NN'), ('this', 'DT'), ('week', 'NN'), (':)', 'NN')]

The output of this function is an array of tuples. Each array element is composed of a word in our tweet list and a tag value (i.e. JJ, NNP, etc.). You can find the meaning of each of these tags here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.

Some of the tags found in the twitter sample are listed below with their corresponding meaning:

NNP: Noun, proper, singular
NN: Noun, common, singular or mass
IN: Preposition or conjunction, subordinating
VBG: Verb, gerund or present participle
VBN: Verb, past participle

Therefore, through these tags we can determine whether a word is a noun or a verb and so on and so forth. The function below will lemmatize a sentence and assigns each word as either a verb or a noun. The function below can be readily modified to incorporate things like adjectives, pronouns ,symbols and other things.

In [23]:

def lemmatize_sentence(tokens):
    lemmatizer = WordNetLemmatizer()
    lemmatized_sentence = []
    for word, tag in pos_tag(tokens):
        if tag.startswith('NN'):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
        lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
    return lemmatized_sentence

print(lemmatize_sentence(tweet_tokens[0]))

['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'be', 'top', 'engage', 'member', 'in', 'my', 'community', 'this', 'week', ':)']

From the output, we can see that some of the words on the original tweet changed after applying the lemmatizer. For example, the word being changed to be amd the word members was changed to member. Now that we have a way to normalize our data, we can start removing noise from the data.

Step 4- Removing Noise from Text Data

Noise in any kind of data can be defined as any unwanted signal. When dealing with text data for sentiment analysis, noise will depend on what features of the data do not add meaning or information to our dataset and must be chosen carefully.

A word that can be typically considered as text noise is known as a stop word. Stop words are the most common words in a language and are generally filtered out from language processing. Some examples of common stop words in English are words like "the", "is", "for","was" and "a". You can also define the stop words for your particular dataset by building a stop list based on the collection frequency of the words. Collection frequency refers to the total number of times a term appears in a document. The words with highest collection frequency are generally the stop words.

In addition to removing stop words from our dataset, other sources of noise that will be removed from our Twitter data are:

Hyperlinks
Twitter Username handles
Punctuation
Special Characters

To remove hyperlinks, we'll use regular expressions that will replace any strings that begin with http:// or https:// with an empty string using the .sub method.

re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*,]|(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)

To remove twitter handles we'll remove stuff that begins with @

re.sub("(@[A-Za-z0-9_]+)","", token)

To remove punctuation we can use the string.punctuation method.

To remove stop words we can use the stop words dictionary built into nltk

stop_words = stopwords.words('english')

Putting it all together, allows us to define the denoising function below:

In [8]:

stop_words = stopwords.words('english')

def remove_noise(tweet_tokens, stop_words = ()):

    cleaned_tokens = []

    for token, tag in pos_tag(tweet_tokens):
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
                       '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
        token = re.sub("(@[A-Za-z0-9_]+)","", token)

        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'

        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)

        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

Let's see what happens after running our denoising function

In [9]:

print(remove_noise(tweet_tokens[0], stop_words))

['#followfriday', 'top', 'engage', 'member', 'community', 'week', ':)']

Cool! We've succesfully removed twitter handles from our data. Stop words have also been removed. Notice how the words have also all been turned to lowercase.

Let's apply our denoising function to all the tweets in our data now using the code below.

In [85]:

#Clean up of all tweets
positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

positive_cleaned_tokens_list = []
negative_cleaned_tokens_list = []

for tokens in positive_tweet_tokens:
    positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

for tokens in negative_tweet_tokens:
    negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

As a sanity check let's look at the original tweet and then the denoised tweet:

In [86]:

print('This is the tweet that\'s being denoised')
print(positive_tweet_tokens[300])

This is the tweet that's being denoised
['Stats', 'for', 'the', 'day', 'have', 'arrived', '.', '2', 'new', 'followers', 'and', 'NO', 'unfollowers', ':)', 'via', 'http://t.co/xxlXs6xYwe', '.']

In [88]:

print('This is the denoised tweet:')
print(positive_cleaned_tokens_list[300])

This is the denoised tweet:
['stats', 'day', 'arrive', '2', 'new', 'follower', 'unfollowers', ':)', 'via']

Something to keep in mind here, is that words that haven't been spaced out are treated as a single word. For example, the phrase 'IamVeryHappy' would be treated as a single word instead of four different words and will require a specific script to handle these kinds of cases.

Step 5 - Determining Word Density

Let's find out what the most common words are on our denoised tweet database by building a collection frequency generator function and pair it with the .most_common method from nltk. I've also plotted the distribution for each of the positive and negative tokens below.

In [22]:

def get_all_words(cleaned_tokens_list):
    for tokens in cleaned_tokens_list:
        for token in tokens:
            yield token

In [220]:

all_pos_words = get_all_words(positive_cleaned_tokens_list)
freq_dist_pos = FreqDist(all_pos_words)
print('These are the 20 most common words in the positive tweets:')
print(freq_dist_pos.most_common(20))
fig = plt.figure(figsize=(12,8))
freq_dist_pos.plot(20, cumulative=False,color = 'purple', linestyle = ':', marker='.', markersize=16)
plt.show()

These are the 20 most common words in the positive tweets:
[(':)', 3691), (':-)', 701), (':d', 658), ('thanks', 388), ('follow', 357), ('love', 333), ('...', 290), ('good', 283), ('get', 263), ('thank', 253), ('u', 245), ('day', 242), ('like', 229), ('see', 195), ('happy', 192), ("i'm", 183), ('great', 175), ('hi', 173), ('go', 167), ('back', 163)]

In [221]:

all_neg_words = get_all_words(negative_cleaned_tokens_list)
freq_dist_neg = FreqDist(all_neg_words)
print('These are the 20 most common words in the negative tweets:')
print(freq_dist_neg.most_common(20))
fig = plt.figure(figsize=(12,8))
freq_dist_neg.plot(20, cumulative=False,color = 'purple', linestyle = ':', marker='.', markersize=16)
plt.show()

These are the 20 most common words in the negative tweets:
[(':(', 4585), (':-(', 501), ("i'm", 343), ('...', 332), ('get', 325), ('miss', 291), ('go', 275), ('please', 275), ('want', 246), ('like', 218), ('♛', 210), ('》', 210), ('u', 193), ("can't", 180), ('time', 160), ('follow', 156), ('sorry', 149), ('one', 149), ('see', 145), ('day', 144)]

Look at that! The :) and :( emoji's are parsed as positive and negative respectively and are also the most common elements in their respective lists. Now let's start building the model.

Step 6 - Preparing Data for the Model

For now, the model we are trying to build involves just two sentiments, positive and negative. One can incorporate more sentiments into the model but I'll cover that in a different section here or maybe a different post...

Regardless of how many sentiments we want to break our data into, we need to split our data into training and testing datasets for modeling in a similar way as one does with regression and classification tasks. Before splitting the data however, we'll need to convert our tokenized tweets and denoised tweets into a dictionary. This can be done with the generator function below.

In [89]:

def get_tweets_for_model(cleaned_tokens_list):
    for tweet_tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tweet_tokens)

In the function above, each dictionary entry is being associated with a 'True' value since I'll be using a Naive Bayes Classifier built into nltk to start things off. Different classifiers will have different dictionary requirements which I'll explore later. An example of the dictionary for a positive and negative tweet from the denoised list are shown below

In [370]:

print('This is the dictionary for the tokenized and denoised positive tweets')
positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
list(positive_tokens_for_model)[:5]

This is the dictionary for the tokenized and denoised positive tweets

Out[370]:

[{'#followfriday': True,
  'top': True,
  'engage': True,
  'member': True,
  'community': True,
  'week': True,
  ':)': True},
 {'hey': True,
  'james': True,
  'odd': True,
  ':/': True,
  'please': True,
  'call': True,
  'contact': True,
  'centre': True,
  '02392441234': True,
  'able': True,
  'assist': True,
  ':)': True,
  'many': True,
  'thanks': True},
 {'listen': True,
  'last': True,
  'night': True,
  ':)': True,
  'bleed': True,
  'amazing': True,
  'track': True,
  'scotland': True},
 {'congrats': True, ':)': True},
 {'yeaaaah': True,
  'yippppy': True,
  'accnt': True,
  'verify': True,
  'rqst': True,
  'succeed': True,
  'get': True,
  'blue': True,
  'tick': True,
  'mark': True,
  'fb': True,
  'profile': True,
  ':)': True,
  '15': True,
  'day': True}]

In [369]:

print('This is the dictionary for the tokenized and denoised negative tweets')
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)
list(negative_tokens_for_model)[:5]

This is the dictionary for the tokenized and denoised negative tweets

Out[369]:

[{'hopeless': True, 'tmr': True, ':(': True},
 {'everything': True,
  'kid': True,
  'section': True,
  'ikea': True,
  'cute': True,
  'shame': True,
  "i'm": True,
  'nearly': True,
  '19': True,
  '2': True,
  'month': True,
  ':(': True},
 {'heart': True, 'slide': True, 'waste': True, 'basket': True, ':(': True},
 {'“': True,
  'hate': True,
  'japanese': True,
  'call': True,
  'ban': True,
  ':(': True,
  '”': True},
 {'dang': True,
  'start': True,
  'next': True,
  'week': True,
  'work': True,
  ':(': True}]

Great!! Now we can split our dataset into training and testing sets. To do this, we'll start by adding a label of 'Positive' to the positive tweets and a label of 'Negative' to the negative tweets. Then, we'll combine these two dictionaries together into the variable called dataset. Next, the elements in dataset will be randomly shuffled. Finally, we can split out data into training and testing sets into whatever split we think best. In this case, our dataset is made up of 10000 tweeets total. I'll be doing an 80:20 split so my test set is comprised of 8000 tweets and my training sets will be comprised of 2000 tweets.

In [229]:

positive_dataset = [(tweet_dict, "Positive") 
                    for tweet_dict in positive_tokens_for_model]
negative_dataset = [(tweet_dict, "Negative") 
                    for tweet_dict in negative_tokens_for_model]

#Combine positive and negative tweets
dataset = positive_dataset + negative_dataset

#Shuffle generated dataset
random.shuffle(dataset)

#Do an 80:20 split on the dataset. 80 on the testing set and 20 on the training set
train_data = dataset[:2000]
test_data  = dataset[8000:]

In [366]:

#I'm only showing the first 5 tweets because the output is ridiculous otherwise
train_data[0:5]

Out[366]:

[({'stress': True, 'come': True, ':(': True}, 'Negative'),
 ({'never': True,
   'see': True,
   'positive': True,
   'kha': True,
   'u': True,
   'could': True,
   'also': True,
   'mention': True,
   'atleast': True,
   'go': True,
   ':)': True},
  'Positive'),
 ({'question': True,
   'flaw': True,
   'pain': True,
   'negate': True,
   'strength': True,
   ':)': True},
  'Positive'),
 ({'thank': True,
   'lovely': True,
   'weekend': True,
   'everyone': True,
   ':-)': True},
  'Positive'),
 ({':(': True, 'asleep': True}, 'Negative')]

Step 7 - Building and Testing the Model

Now that we have our data split into training and test sets, we can use the Naive Bayes Classiier to build our model. We can also gauge the accuracy of the model using the accuracy method on the resulting model. The accuracy of the model is based on the percentage of tweets for which the model correctly predicted the sentiment. Let's run it and see what we get!

In [141]:

classifier = NaiveBayesClassifier.train(train_data)

print("Accuracy is:", classify.accuracy(classifier, test_data))
print(classifier.show_most_informative_features(10))

Accuracy is: 0.9935
Most Informative Features
                      :( = True           Negati : Positi =    587.0 : 1.0
                      :) = True           Positi : Negati =    474.1 : 1.0
                   great = True           Positi : Negati =     16.7 : 1.0
                  friday = True           Positi : Negati =     14.4 : 1.0
                   sorry = True           Negati : Positi =     12.1 : 1.0
                    miss = True           Negati : Positi =     10.0 : 1.0
                   enjoy = True           Positi : Negati =      9.7 : 1.0
                    hate = True           Negati : Positi =      9.6 : 1.0
                   thank = True           Positi : Negati =      9.1 : 1.0
                   can't = True           Negati : Positi =      8.6 : 1.0
None

Wow! The model has a 99.4% accuracy when it comes to predicting the sentiment! The informative features shows the ratio of positive and negative tweets associated with each token. The :) and :( emojis as we saw before were the most frequently occurring tokens. The :) emoji had a positive to negative ratio of 474.1:1 which means that for every 475 tweets that have this emoji in it, 474 of them are expected to have a positive sentiment. On the other hand the :( emoji has a negative to positive ratio of 587:1 which means that our of every 588 tweets containing this emoji, 587 are expected to have a negative sentiment.

Now I can easily test the performance of this model using any random phrases. I generated these 4 sentences using a random sentence generator I found online here: https://randomwordgenerator.com/sentence.php

In [195]:

custom_tweet = "They did nothing as the raccoon attacked the lady’s bag of food.\
                After coating myself in vegetable oil I found my success rate skyrocketed. \
                The beach was crowded with snow leopards.\
                Lets all be unique together until we realise we are all the same."

custom_tokens = remove_noise(word_tokenize(custom_tweet))
print(custom_tokens)
print(classifier.classify(dict([token, True] for token in custom_tokens)))

['they', 'do', 'nothing', 'as', 'the', 'raccoon', 'attack', 'the', 'lady', '’', 's', 'bag', 'of', 'food', 'after', 'coat', 'myself', 'in', 'vegetable', 'oil', 'i', 'find', 'my', 'success', 'rate', 'skyrocket', 'the', 'beach', 'be', 'crowd', 'with', 'snow', 'leopard', 'lets', 'all', 'be', 'unique', 'together', 'until', 'we', 'realise', 'we', 'be', 'all', 'the', 'same']
Negative

The classifier determined that the overall sentiment of these 4 sentences is Negative. Which I agree with. I don't think I'd be too happy with either of those randomly posited scenarios lol

Visualizing the Data

Now that we have our data cleaned up and a working model let's visualize some things! First, I'll make a wordcloud for each of the positive and negative dataset as shown below. I'll apply a twitter logo mask to it because it seems appropriate given the context.

In [360]:

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image

twitter_mask = np.array(Image.open("twitter logo.png"))

# generate the word cloud for the positive tweets   
posWC = WordCloud(
                          max_words=2000,
                          mask = twitter_mask,
                          contour_width=0,
                          max_font_size=500,
                          font_step=2,
                          background_color='white',
                          width=550,
                          height=550
                          ).generate(str(positive_cleaned_tokens_list))

# generate the word cloud for the negative tweets
negWC = WordCloud(
                          max_words=2000,
                          mask = twitter_mask,
                          contour_width=0,
                          max_font_size=500,
                          font_step=2,
                          background_color='black',
                          width=550,
                          height=550
                          ).generate(str(negative_cleaned_tokens_list))

fig,axs = plt.subplots(1,2, figsize=(25,30))
axs[0].axis("off")
axs[1].axis("off")
axs[0].imshow(posWC)
axs[1].imshow(negWC)
plt.show()

Neat!! I'll definitely have fun playing with these more in the future!

Using VADER for Polarity Scores

VADER ( Valence Aware Dictionary for Sentiment Reasoning) is a model used for sentiment analysis of text. It is sensitive to positive and negative emotions (also referred to as polarity) as well as the intensity or strength of emotion.

VADER relies on a dictionary that maps lexical features to emotion intensities which are known as sentiment scores or polarity scores. The sentiment score is obtained by summing up the intensity of each word in the piece of text one is documenting. Therefore, we can use VADER to determine whether a sentence, a paragraph, a tweet, or even a whole document has a particular sentiment. So long as your data is in string format you are good to go.

I'll start by generating an instance of VADER as shown below

In [299]:

sid = SentimentIntensityAnalyzer()

We can now use this analyzer on any text data we want. Next, I'll import the pandas library to place all of the tweets we used before into a dataframe. I'm doing this because then I can readily apply the polarity calculations to each tweet in a clean and efficient way. I will then loop over each of the positive tweets that we denoised earlier and use the join method to convert each tweet from a comma separated list into a string that is readily parsable by VADER. I'll do the same for the denoised negative tweets.

In [301]:

import pandas as pd

pos_tweets_list = []
for i in range(len(positive_cleaned_tokens_list)):
    tweet = ' '.join(positive_cleaned_tokens_list[i])
    pos_tweets_list.append(tweet)
    
df_pos = pd.DataFrame(pos_tweets_list,columns=['positive_tweets'])
df_pos

Out[301]:

	positive_tweets
0	#followfriday top engage member community week :)
1	hey james odd :/ please call contact centre 02...
2	listen last night :) bleed amazing track scotland
3	congrats :)
4	yeaaaah yippppy accnt verify rqst succeed get ...
...	...
4995	chris that's great hear :) due time reminder i...
4996	thanks shout-out :) great aboard
4997	hey :) long time talk ...
4998	matt would say welcome adulthood ... :)
4999	could say egg face :-)

5000 rows × 1 columns

And the same thing for the negative tweets

In [302]:

neg_tweets_list = []
for i in range(len(negative_cleaned_tokens_list)):
    tweet = ' '.join(negative_cleaned_tokens_list[i])
    neg_tweets_list.append(tweet)
    
df_neg = pd.DataFrame(neg_tweets_list,columns=['negative_tweets'])
df_neg

Out[302]:

	negative_tweets
0	hopeless tmr :(
1	everything kid section ikea cute shame i'm nea...
2	heart slide waste basket :(
3	“ hate japanese call ban :( :( ”
4	dang start next week work :(
...	...
4995	wanna change avi usanele :(
4996	puppy broke foot :(
4997	where's jaebum baby picture :(
4998	mr ahmad maslan cook :(
4999	hull supporter expect misserable week :-(

5000 rows × 1 columns

Now we can apply the polarity scores onto each tweet in each of our dataframes and generate columns that contain the positive, neutral, negative, and compound polarity score for each tweet which I also visualize in histograms below

In [305]:

df_pos['compound'] = [sid.polarity_scores(x)['compound'] for x in df_pos['positive_tweets']]
df_pos['neg']      = [sid.polarity_scores(x)['neg'] for x in      df_pos['positive_tweets']]
df_pos['neu']      = [sid.polarity_scores(x)['neu'] for x in      df_pos['positive_tweets']]
df_pos['pos']      = [sid.polarity_scores(x)['pos'] for x in      df_pos['positive_tweets']]
df_pos

Out[305]:

	positive_tweets	compound	neg	neu	pos
0	#followfriday top engage member community week :)	0.7351	0.000	0.357	0.643
1	hey james odd :/ please call contact centre 02...	0.5423	0.215	0.411	0.374
2	listen last night :) bleed amazing track scotland	0.7783	0.000	0.469	0.531
3	congrats :)	0.7506	0.000	0.000	1.000
4	yeaaaah yippppy accnt verify rqst succeed get ...	0.7351	0.000	0.677	0.323
...	...	...	...	...	...
4995	chris that's great hear :) due time reminder i...	0.7964	0.000	0.608	0.392
4996	thanks shout-out :) great aboard	0.8779	0.000	0.083	0.917
4997	hey :) long time talk ...	0.4588	0.000	0.625	0.375
4998	matt would say welcome adulthood ... :)	0.7184	0.000	0.455	0.545
4999	could say egg face :-)	0.3182	0.000	0.635	0.365

5000 rows × 5 columns

In [363]:

import seaborn as sns
fig,axs = plt.subplots(2,2, figsize = (30,20))
sns.histplot(df_pos['compound'] , ax = axs[0,0], color ='#011d59' , alpha = 0.75, stat='density')
sns.kdeplot(df_pos['compound'], color='crimson', ax=axs[0,0])
axs[0,0].legend(labels=["Compound"], title = "Positive Tweets")

sns.histplot(df_pos['neg']      , ax=axs[0,1] , color ='#a95927' , alpha = 0.75, stat='density')
sns.kdeplot(df_pos['neg'], color='crimson', ax=axs[0,1])
axs[0,1].legend(labels=["Negative"], title = "Positive Tweets")

sns.histplot(df_pos['neu']      , ax=axs[1,0], color ='#581845' , alpha = 0.75, stat='density')
sns.kdeplot(df_pos['neu'], color='crimson', ax=axs[1,0])
axs[1,0].legend(labels=["Neutral"], title = "Positive Tweets")

sns.histplot(df_pos['pos']      , ax=axs[1,1] , color ='#4da13f' , alpha = 0.75, stat='density')
sns.kdeplot(df_pos['pos'], color='crimson', ax=axs[1,1])
axs[1,1].legend(labels=["Positive"], title = "Positive Tweets")

plt.show()

In [306]:

df_neg['compound'] = [sid.polarity_scores(x)['compound'] for x in df_neg['negative_tweets']]
df_neg['neg']      = [sid.polarity_scores(x)['neg'] for x in      df_neg['negative_tweets']]
df_neg['neu']      = [sid.polarity_scores(x)['neu'] for x in      df_neg['negative_tweets']]
df_neg['pos']      = [sid.polarity_scores(x)['pos'] for x in      df_neg['negative_tweets']]
df_neg

Out[306]:

	negative_tweets	compound	neg	neu	pos
0	hopeless tmr :(	-0.7096	0.855	0.145	0.000
1	everything kid section ikea cute shame i'm nea...	-0.4588	0.353	0.471	0.176
2	heart slide waste basket :(	-0.6908	0.655	0.345	0.000
3	“ hate japanese call ban :( :( ”	-0.9201	0.868	0.132	0.000
4	dang start next week work :(	-0.4404	0.367	0.633	0.000
...	...	...	...	...	...
4995	wanna change avi usanele :(	-0.4404	0.420	0.580	0.000
4996	puppy broke foot :(	-0.6908	0.740	0.260	0.000
4997	where's jaebum baby picture :(	-0.4404	0.420	0.580	0.000
4998	mr ahmad maslan cook :(	-0.4404	0.420	0.580	0.000
4999	hull supporter expect misserable week :-(	-0.1027	0.291	0.465	0.244

5000 rows × 5 columns

In [362]:

fig,axs = plt.subplots(2,2, figsize = (30,20))

sns.histplot(df_neg['compound'] , ax = axs[0,0], color ='#011d59' , alpha = 0.75, stat='density')
sns.kdeplot(df_neg['compound'], color='crimson', ax=axs[0,0])
axs[0,0].legend(labels=["Compound"], title = "Negative Tweets")

sns.histplot(df_neg['neg']      , ax=axs[0,1] , color ='#a95927' , alpha = 0.75, stat='density')
sns.kdeplot(df_neg['neg'], color='crimson', ax=axs[0,1])
axs[0,1].legend(labels=["Negative"], title = "Negative Tweets")

sns.histplot(df_neg['neu']      , ax=axs[1,0], color ='#581845' , alpha = 0.75, stat='density')
sns.kdeplot(df_neg['neu'], color='crimson', ax=axs[1,0])
axs[1,0].legend(labels=["Neutral"], title = "Negative Tweets")

sns.histplot(df_neg['pos']      , ax=axs[1,1] , color ='#4da13f' , alpha = 0.75, stat='density')
sns.kdeplot(df_neg['pos'], color='crimson', ax=axs[1,1])
axs[1,1].legend(labels=["Positive"], title = "Negative Tweets")

plt.show()
plt.show()

Conclusions

I've shown the necessary steps required to carry out sentiment analysis. Though Twitter data was used for this example, the methods shown here can be readily adapteed to other text formats. I have ideas of projects to try using this so stay tuned!