Sentiment Analysis with Natural Language Processing
jupyter_notebook
machine_learning
data_science
natural_language_processing
Sentiment Analysis on Twitter Data
In this article, I'll go through the steps of carrying out sentiment analysis. Sentiment analysis is a natural language processing (NLP) method that can be used to make assessments about the positive, negative, or neutral values of data. Sentiment analysis is often performed on textual data like customer reviews and tweets for example to help businesses monitor how their customer/userbase feels about their product and thus make adjustments to their products, brand as needed. I'll show the process of tokenizing data, getting collection frequency plots, preparing the data for modeling, visualizing the dataset with wordclouds, and using VADER for polarity classification.Rough Outline for Sentiment Analysis using NLP
- Installing NLTK and Downloading the Data
- Tokenizing the Data
- Normalizing the Data
- Removing Noise from the Data
- Converting Tokens to a Dictionary
- Splitting the Dataset for Training and Testing the Model
- Determining Word Density
- Preparing Data for the Model
- Contructing dictionary
- Splitting data into training and testing sets
- Building and Testing the Model
- Put tweets into dataframe (this could be done earlier)
- Get polarity scores via VADER
- Data visualizations (Word Clouds, frequency plots, ...)
Modules used in this work
The following modules are used in sentiment analysis applications.
- nltk : Used for natural language processing.
- re : Used for regular expression operations.
- string : Used for common string operations
The API documentation for each of these modules can be found here:
#Stuff for general sentiment analysis
import nltk
from nltk.corpus import twitter_samples,stopwords
from nltk.tag import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk import FreqDist, classify, NaiveBayesClassifier
#Stuff for VADER
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
#Stuff for dealing with strings and regular expressions
import re, string, random
[nltk_data] Downloading package vader_lexicon to [nltk_data] C:\Users\vmurc\AppData\Roaming\nltk_data... [nltk_data] Package vader_lexicon is already up-to-date!
Step 1: Downloading data from NLTK
This section will delineate the basic steps of loading and preliminary inspection of our data using NLTK.
Loading data with NLTK
The first thing that needs to be done is load the data onto our jupyter notebook. NLTK has a download method from which we can extract a variety of data from a variety of sources. You can find the complete list of datasets available to NLTK here: https://www.nltk.org/nltk_data/Here I'll be downloading the datasets shown below
#Downloading the data
nltk.download('twitter_samples') #30000 tweets. 5000 positive, 5000 negative. Rest are neutral.
nltk.download('punkt') #Pretrained model to tokenize words
nltk.download('wordnet') #Lexical database to help determine base word
nltk.download('averaged_perceptron_tagger') #Used to determine context of word in sentence
nltk.download('omw-1.4')
nltk.download('stopwords')
[nltk_data] Downloading package twitter_samples to [nltk_data] C:\Users\vmurc\AppData\Roaming\nltk_data... [nltk_data] Package twitter_samples is already up-to-date! [nltk_data] Downloading package punkt to [nltk_data] C:\Users\vmurc\AppData\Roaming\nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package wordnet to [nltk_data] C:\Users\vmurc\AppData\Roaming\nltk_data... [nltk_data] Package wordnet is already up-to-date! [nltk_data] Downloading package averaged_perceptron_tagger to [nltk_data] C:\Users\vmurc\AppData\Roaming\nltk_data... [nltk_data] Package averaged_perceptron_tagger is already up-to- [nltk_data] date! [nltk_data] Downloading package omw-1.4 to [nltk_data] C:\Users\vmurc\AppData\Roaming\nltk_data... [nltk_data] Package omw-1.4 is already up-to-date! [nltk_data] Downloading package stopwords to [nltk_data] C:\Users\vmurc\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date!
True
Step 2 — Tokenizing the Data
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')
Before tokenizing the data, we need a pre-trained model that will our subsequent tokenizing efforts. In this example, the pre-trained model will be the "punkt" resource which was downloaded earlier. Now we can tokenize the data using the tokenized method which produces the array below
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
print(tweet_tokens[:5])
[['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':)'], ['@Lamb2ja', 'Hey', 'James', '!', 'How', 'odd', ':/', 'Please', 'call', 'our', 'Contact', 'Centre', 'on', '02392441234', 'and', 'we', 'will', 'be', 'able', 'to', 'assist', 'you', ':)', 'Many', 'thanks', '!'], ['@DespiteOfficial', 'we', 'had', 'a', 'listen', 'last', 'night', ':)', 'As', 'You', 'Bleed', 'is', 'an', 'amazing', 'track', '.', 'When', 'are', 'you', 'in', 'Scotland', '?', '!'], ['@97sides', 'CONGRATS', ':)'], ['yeaaaah', 'yippppy', '!', '!', '!', 'my', 'accnt', 'verified', 'rqst', 'has', 'succeed', 'got', 'a', 'blue', 'tick', 'mark', 'on', 'my', 'fb', 'profile', ':)', 'in', '15', 'days']]
This tokenization returns the content of a bunch of tweets from that dataset as an array. As can be seen, the array is composed of a bunch of words and emojis. In addition to this, many of these entries also have unwanted characters within them (i.e., @ and _ ). These characters will need to be removed as part of the data cleanup.
Step 3 — Normalizing the Data
Words can be expressed in a variety of ways to convey meaning. For instance, ate and eaten are different tenses of the verb eat. An analysis may require these different tenses to be converted into their base, fundamental, canonical form. This process is known as normalization.Normalization is helpful because it allows us to group words together that have the same overarching meaning but have different written forms. You can think of the normalization process of a sort of clustering method but applied to words. Two methods that are commonly employed for this are stemming and lemmatization. The differences between the two are shown in the image below:
In stemming we are removing affixes. An affix is a morpheme that is added to a word to change its meaning. Common affixes are prefixes (an affix that is added to the beginning of a word) and suffixes (an affix that is added at the end of a word). Stemming is useful when dealing with simple verb forms and is thus a bit of a crude heuristic process.
In lemmatization, the algorithm normalizes a word within the context of vocabulary and through the use of morphological analysis in order to produce a lemma. A lemma in morphology is known as the canonical or dictionary form of a set of word forms. This lemmatization process allows different words to be indexed,mapped, or traced back to a single word. For instance, speak, speaks, spoke, spoken and speaking are all forms of speak (known as the lexeme).
As is the case with algorithms, one must always consider the tradeoffs between speed and accuracy. Stemming is generally faster than lemmatization but is also less accurate. Lemmatization is slower but generates more accurate results. I'll be using lemmatization in this demo.
Now that normalization schemes have been established let's determine the context of each word in the text. This process will give a tag (tagging algorithm) to each word that allows words to be related to one another. This is done via the pos_tag
function in nltk.
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
print(pos_tag(tweet_tokens[0]))
[('#FollowFriday', 'JJ'), ('@France_Inte', 'NNP'), ('@PKuchly57', 'NNP'), ('@Milipol_Paris', 'NNP'), ('for', 'IN'), ('being', 'VBG'), ('top', 'JJ'), ('engaged', 'VBN'), ('members', 'NNS'), ('in', 'IN'), ('my', 'PRP$'), ('community', 'NN'), ('this', 'DT'), ('week', 'NN'), (':)', 'NN')]
The output of this function is an array of tuples. Each array element is composed of a word in our tweet list and a tag value (i.e. JJ, NNP, etc.). You can find the meaning of each of these tags here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.
Some of the tags found in the twitter sample are listed below with their corresponding meaning:
- NNP: Noun, proper, singular
- NN: Noun, common, singular or mass
- IN: Preposition or conjunction, subordinating
- VBG: Verb, gerund or present participle
- VBN: Verb, past participle
Therefore, through these tags we can determine whether a word is a noun or a verb and so on and so forth. The function below will lemmatize a sentence and assigns each word as either a verb or a noun. The function below can be readily modified to incorporate things like adjectives, pronouns ,symbols and other things.
def lemmatize_sentence(tokens):
lemmatizer = WordNetLemmatizer()
lemmatized_sentence = []
for word, tag in pos_tag(tokens):
if tag.startswith('NN'):
pos = 'n'
elif tag.startswith('VB'):
pos = 'v'
else:
pos = 'a'
lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
return lemmatized_sentence
print(lemmatize_sentence(tweet_tokens[0]))
['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'be', 'top', 'engage', 'member', 'in', 'my', 'community', 'this', 'week', ':)']
From the output, we can see that some of the words on the original tweet changed after applying the lemmatizer. For example, the word being
changed to be
amd the word members
was changed to member
. Now that we have a way to normalize our data, we can start removing noise from the data.
Step 4- Removing Noise from Text Data
Noise in any kind of data can be defined as any unwanted signal. When dealing with text data for sentiment analysis, noise will depend on what features of the data do not add meaning or information to our dataset and must be chosen carefully.A word that can be typically considered as text noise is known as a stop word
. Stop words are the most common words in a language and are generally filtered out from language processing. Some examples of common stop words in English are words like "the", "is", "for","was" and "a". You can also define the stop words for your particular dataset by building a stop list
based on the collection frequency
of the words. Collection frequency refers to the total number of times a term appears in a document. The words with highest collection frequency are generally the stop words.
In addition to removing stop words from our dataset, other sources of noise that will be removed from our Twitter data are:
- Hyperlinks
- Twitter Username handles
- Punctuation
- Special Characters
To remove hyperlinks, we'll use regular expressions that will replace any strings that begin with http://
or https://
with an empty string using the .sub
method.
re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
To remove twitter handles we'll remove stuff that begins with @
re.sub("(@[A-Za-z0-9_]+)","", token)
To remove punctuation we can use the string.punctuation
method.
To remove stop words we can use the stop words dictionary built into nltk
stop_words = stopwords.words('english')
Putting it all together, allows us to define the denoising function below:
stop_words = stopwords.words('english')
def remove_noise(tweet_tokens, stop_words = ()):
cleaned_tokens = []
for token, tag in pos_tag(tweet_tokens):
token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
'(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
token = re.sub("(@[A-Za-z0-9_]+)","", token)
if tag.startswith("NN"):
pos = 'n'
elif tag.startswith('VB'):
pos = 'v'
else:
pos = 'a'
lemmatizer = WordNetLemmatizer()
token = lemmatizer.lemmatize(token, pos)
if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
cleaned_tokens.append(token.lower())
return cleaned_tokens
Let's see what happens after running our denoising function
print(remove_noise(tweet_tokens[0], stop_words))
['#followfriday', 'top', 'engage', 'member', 'community', 'week', ':)']
Cool! We've succesfully removed twitter handles from our data. Stop words have also been removed. Notice how the words have also all been turned to lowercase.
Let's apply our denoising function to all the tweets in our data now using the code below.
#Clean up of all tweets
positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')
positive_cleaned_tokens_list = []
negative_cleaned_tokens_list = []
for tokens in positive_tweet_tokens:
positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
for tokens in negative_tweet_tokens:
negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
As a sanity check let's look at the original tweet and then the denoised tweet:
print('This is the tweet that\'s being denoised')
print(positive_tweet_tokens[300])
This is the tweet that's being denoised ['Stats', 'for', 'the', 'day', 'have', 'arrived', '.', '2', 'new', 'followers', 'and', 'NO', 'unfollowers', ':)', 'via', 'http://t.co/xxlXs6xYwe', '.']
print('This is the denoised tweet:')
print(positive_cleaned_tokens_list[300])
This is the denoised tweet: ['stats', 'day', 'arrive', '2', 'new', 'follower', 'unfollowers', ':)', 'via']
Something to keep in mind here, is that words that haven't been spaced out are treated as a single word. For example, the phrase 'IamVeryHappy' would be treated as a single word instead of four different words and will require a specific script to handle these kinds of cases.
Step 5 - Determining Word Density
Let's find out what the most common words are on our denoised tweet database by building a collection frequency generator function and pair it with the.most_common
method from nltk. I've also plotted the distribution for each of the positive and negative tokens below.
def get_all_words(cleaned_tokens_list):
for tokens in cleaned_tokens_list:
for token in tokens:
yield token
all_pos_words = get_all_words(positive_cleaned_tokens_list)
freq_dist_pos = FreqDist(all_pos_words)
print('These are the 20 most common words in the positive tweets:')
print(freq_dist_pos.most_common(20))
fig = plt.figure(figsize=(12,8))
freq_dist_pos.plot(20, cumulative=False,color = 'purple', linestyle = ':', marker='.', markersize=16)
plt.show()
These are the 20 most common words in the positive tweets: [(':)', 3691), (':-)', 701), (':d', 658), ('thanks', 388), ('follow', 357), ('love', 333), ('...', 290), ('good', 283), ('get', 263), ('thank', 253), ('u', 245), ('day', 242), ('like', 229), ('see', 195), ('happy', 192), ("i'm", 183), ('great', 175), ('hi', 173), ('go', 167), ('back', 163)]
all_neg_words = get_all_words(negative_cleaned_tokens_list)
freq_dist_neg = FreqDist(all_neg_words)
print('These are the 20 most common words in the negative tweets:')
print(freq_dist_neg.most_common(20))
fig = plt.figure(figsize=(12,8))
freq_dist_neg.plot(20, cumulative=False,color = 'purple', linestyle = ':', marker='.', markersize=16)
plt.show()
These are the 20 most common words in the negative tweets: [(':(', 4585), (':-(', 501), ("i'm", 343), ('...', 332), ('get', 325), ('miss', 291), ('go', 275), ('please', 275), ('want', 246), ('like', 218), ('♛', 210), ('》', 210), ('u', 193), ("can't", 180), ('time', 160), ('follow', 156), ('sorry', 149), ('one', 149), ('see', 145), ('day', 144)]
Look at that! The :) and :( emoji's are parsed as positive and negative respectively and are also the most common elements in their respective lists. Now let's start building the model.
Step 6 - Preparing Data for the Model
For now, the model we are trying to build involves just two sentiments, positive and negative. One can incorporate more sentiments into the model but I'll cover that in a different section here or maybe a different post...Regardless of how many sentiments we want to break our data into, we need to split our data into training and testing datasets for modeling in a similar way as one does with regression and classification tasks. Before splitting the data however, we'll need to convert our tokenized tweets and denoised tweets into a dictionary. This can be done with the generator function below.
def get_tweets_for_model(cleaned_tokens_list):
for tweet_tokens in cleaned_tokens_list:
yield dict([token, True] for token in tweet_tokens)
In the function above, each dictionary entry is being associated with a 'True' value since I'll be using a Naive Bayes Classifier built into nltk to start things off. Different classifiers will have different dictionary requirements which I'll explore later. An example of the dictionary for a positive and negative tweet from the denoised list are shown below
print('This is the dictionary for the tokenized and denoised positive tweets')
positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
list(positive_tokens_for_model)[:5]
This is the dictionary for the tokenized and denoised positive tweets
[{'#followfriday': True, 'top': True, 'engage': True, 'member': True, 'community': True, 'week': True, ':)': True}, {'hey': True, 'james': True, 'odd': True, ':/': True, 'please': True, 'call': True, 'contact': True, 'centre': True, '02392441234': True, 'able': True, 'assist': True, ':)': True, 'many': True, 'thanks': True}, {'listen': True, 'last': True, 'night': True, ':)': True, 'bleed': True, 'amazing': True, 'track': True, 'scotland': True}, {'congrats': True, ':)': True}, {'yeaaaah': True, 'yippppy': True, 'accnt': True, 'verify': True, 'rqst': True, 'succeed': True, 'get': True, 'blue': True, 'tick': True, 'mark': True, 'fb': True, 'profile': True, ':)': True, '15': True, 'day': True}]
print('This is the dictionary for the tokenized and denoised negative tweets')
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)
list(negative_tokens_for_model)[:5]
This is the dictionary for the tokenized and denoised negative tweets
[{'hopeless': True, 'tmr': True, ':(': True}, {'everything': True, 'kid': True, 'section': True, 'ikea': True, 'cute': True, 'shame': True, "i'm": True, 'nearly': True, '19': True, '2': True, 'month': True, ':(': True}, {'heart': True, 'slide': True, 'waste': True, 'basket': True, ':(': True}, {'“': True, 'hate': True, 'japanese': True, 'call': True, 'ban': True, ':(': True, '”': True}, {'dang': True, 'start': True, 'next': True, 'week': True, 'work': True, ':(': True}]
Great!! Now we can split our dataset into training and testing sets. To do this, we'll start by adding a label of 'Positive' to the positive tweets and a label of 'Negative' to the negative tweets. Then, we'll combine these two dictionaries together into the variable called dataset
. Next, the elements in dataset
will be randomly shuffled. Finally, we can split out data into training and testing sets into whatever split we think best. In this case, our dataset is made up of 10000 tweeets total. I'll be doing an 80:20 split so my test set is comprised of 8000 tweets and my training sets will be comprised of 2000 tweets.
positive_dataset = [(tweet_dict, "Positive")
for tweet_dict in positive_tokens_for_model]
negative_dataset = [(tweet_dict, "Negative")
for tweet_dict in negative_tokens_for_model]
#Combine positive and negative tweets
dataset = positive_dataset + negative_dataset
#Shuffle generated dataset
random.shuffle(dataset)
#Do an 80:20 split on the dataset. 80 on the testing set and 20 on the training set
train_data = dataset[:2000]
test_data = dataset[8000:]
#I'm only showing the first 5 tweets because the output is ridiculous otherwise
train_data[0:5]
[({'stress': True, 'come': True, ':(': True}, 'Negative'), ({'never': True, 'see': True, 'positive': True, 'kha': True, 'u': True, 'could': True, 'also': True, 'mention': True, 'atleast': True, 'go': True, ':)': True}, 'Positive'), ({'question': True, 'flaw': True, 'pain': True, 'negate': True, 'strength': True, ':)': True}, 'Positive'), ({'thank': True, 'lovely': True, 'weekend': True, 'everyone': True, ':-)': True}, 'Positive'), ({':(': True, 'asleep': True}, 'Negative')]
Step 7 - Building and Testing the Model
Now that we have our data split into training and test sets, we can use the Naive Bayes Classiier to build our model. We can also gauge the accuracy of the model using theaccuracy
method on the resulting model. The accuracy of the model is based on the percentage of tweets for which the model correctly predicted the sentiment. Let's run it and see what we get!
classifier = NaiveBayesClassifier.train(train_data)
print("Accuracy is:", classify.accuracy(classifier, test_data))
print(classifier.show_most_informative_features(10))
Accuracy is: 0.9935 Most Informative Features :( = True Negati : Positi = 587.0 : 1.0 :) = True Positi : Negati = 474.1 : 1.0 great = True Positi : Negati = 16.7 : 1.0 friday = True Positi : Negati = 14.4 : 1.0 sorry = True Negati : Positi = 12.1 : 1.0 miss = True Negati : Positi = 10.0 : 1.0 enjoy = True Positi : Negati = 9.7 : 1.0 hate = True Negati : Positi = 9.6 : 1.0 thank = True Positi : Negati = 9.1 : 1.0 can't = True Negati : Positi = 8.6 : 1.0 None
Wow! The model has a 99.4% accuracy when it comes to predicting the sentiment! The informative features shows the ratio of positive and negative tweets associated with each token. The :) and :( emojis as we saw before were the most frequently occurring tokens. The :) emoji had a positive to negative ratio of 474.1:1 which means that for every 475 tweets that have this emoji in it, 474 of them are expected to have a positive sentiment. On the other hand the :( emoji has a negative to positive ratio of 587:1 which means that our of every 588 tweets containing this emoji, 587 are expected to have a negative sentiment.
Now I can easily test the performance of this model using any random phrases. I generated these 4 sentences using a random sentence generator I found online here: https://randomwordgenerator.com/sentence.php
custom_tweet = "They did nothing as the raccoon attacked the lady’s bag of food.\
After coating myself in vegetable oil I found my success rate skyrocketed. \
The beach was crowded with snow leopards.\
Lets all be unique together until we realise we are all the same."
custom_tokens = remove_noise(word_tokenize(custom_tweet))
print(custom_tokens)
print(classifier.classify(dict([token, True] for token in custom_tokens)))
['they', 'do', 'nothing', 'as', 'the', 'raccoon', 'attack', 'the', 'lady', '’', 's', 'bag', 'of', 'food', 'after', 'coat', 'myself', 'in', 'vegetable', 'oil', 'i', 'find', 'my', 'success', 'rate', 'skyrocket', 'the', 'beach', 'be', 'crowd', 'with', 'snow', 'leopard', 'lets', 'all', 'be', 'unique', 'together', 'until', 'we', 'realise', 'we', 'be', 'all', 'the', 'same'] Negative
The classifier determined that the overall sentiment of these 4 sentences is Negative. Which I agree with. I don't think I'd be too happy with either of those randomly posited scenarios lol
Visualizing the Data
Now that we have our data cleaned up and a working model let's visualize some things! First, I'll make a wordcloud for each of the positive and negative dataset as shown below. I'll apply a twitter logo mask to it because it seems appropriate given the context.from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
twitter_mask = np.array(Image.open("twitter logo.png"))
# generate the word cloud for the positive tweets
posWC = WordCloud(
max_words=2000,
mask = twitter_mask,
contour_width=0,
max_font_size=500,
font_step=2,
background_color='white',
width=550,
height=550
).generate(str(positive_cleaned_tokens_list))
# generate the word cloud for the negative tweets
negWC = WordCloud(
max_words=2000,
mask = twitter_mask,
contour_width=0,
max_font_size=500,
font_step=2,
background_color='black',
width=550,
height=550
).generate(str(negative_cleaned_tokens_list))
fig,axs = plt.subplots(1,2, figsize=(25,30))
axs[0].axis("off")
axs[1].axis("off")
axs[0].imshow(posWC)
axs[1].imshow(negWC)
plt.show()
Neat!! I'll definitely have fun playing with these more in the future!
Using VADER for Polarity Scores
VADER ( Valence Aware Dictionary for Sentiment Reasoning) is a model used for sentiment analysis of text. It is sensitive to positive and negative emotions (also referred to as polarity) as well as the intensity or strength of emotion.VADER relies on a dictionary that maps lexical features to emotion intensities which are known as sentiment scores or polarity scores. The sentiment score is obtained by summing up the intensity of each word in the piece of text one is documenting. Therefore, we can use VADER to determine whether a sentence, a paragraph, a tweet, or even a whole document has a particular sentiment. So long as your data is in string format you are good to go.
I'll start by generating an instance of VADER as shown below
sid = SentimentIntensityAnalyzer()
We can now use this analyzer on any text data we want. Next, I'll import the pandas library to place all of the tweets we used before into a dataframe. I'm doing this because then I can readily apply the polarity calculations to each tweet in a clean and efficient way. I will then loop over each of the positive tweets that we denoised earlier and use the join
method to convert each tweet from a comma separated list into a string that is readily parsable by VADER. I'll do the same for the denoised negative tweets.
import pandas as pd
pos_tweets_list = []
for i in range(len(positive_cleaned_tokens_list)):
tweet = ' '.join(positive_cleaned_tokens_list[i])
pos_tweets_list.append(tweet)
df_pos = pd.DataFrame(pos_tweets_list,columns=['positive_tweets'])
df_pos
positive_tweets | |
---|---|
0 | #followfriday top engage member community week :) |
1 | hey james odd :/ please call contact centre 02... |
2 | listen last night :) bleed amazing track scotland |
3 | congrats :) |
4 | yeaaaah yippppy accnt verify rqst succeed get ... |
... | ... |
4995 | chris that's great hear :) due time reminder i... |
4996 | thanks shout-out :) great aboard |
4997 | hey :) long time talk ... |
4998 | matt would say welcome adulthood ... :) |
4999 | could say egg face :-) |
5000 rows × 1 columns
And the same thing for the negative tweets
neg_tweets_list = []
for i in range(len(negative_cleaned_tokens_list)):
tweet = ' '.join(negative_cleaned_tokens_list[i])
neg_tweets_list.append(tweet)
df_neg = pd.DataFrame(neg_tweets_list,columns=['negative_tweets'])
df_neg
negative_tweets | |
---|---|
0 | hopeless tmr :( |
1 | everything kid section ikea cute shame i'm nea... |
2 | heart slide waste basket :( |
3 | “ hate japanese call ban :( :( ” |
4 | dang start next week work :( |
... | ... |
4995 | wanna change avi usanele :( |
4996 | puppy broke foot :( |
4997 | where's jaebum baby picture :( |
4998 | mr ahmad maslan cook :( |
4999 | hull supporter expect misserable week :-( |
5000 rows × 1 columns
Now we can apply the polarity scores onto each tweet in each of our dataframes and generate columns that contain the positive, neutral, negative, and compound polarity score for each tweet which I also visualize in histograms below
df_pos['compound'] = [sid.polarity_scores(x)['compound'] for x in df_pos['positive_tweets']]
df_pos['neg'] = [sid.polarity_scores(x)['neg'] for x in df_pos['positive_tweets']]
df_pos['neu'] = [sid.polarity_scores(x)['neu'] for x in df_pos['positive_tweets']]
df_pos['pos'] = [sid.polarity_scores(x)['pos'] for x in df_pos['positive_tweets']]
df_pos
positive_tweets | compound | neg | neu | pos | |
---|---|---|---|---|---|
0 | #followfriday top engage member community week :) | 0.7351 | 0.000 | 0.357 | 0.643 |
1 | hey james odd :/ please call contact centre 02... | 0.5423 | 0.215 | 0.411 | 0.374 |
2 | listen last night :) bleed amazing track scotland | 0.7783 | 0.000 | 0.469 | 0.531 |
3 | congrats :) | 0.7506 | 0.000 | 0.000 | 1.000 |
4 | yeaaaah yippppy accnt verify rqst succeed get ... | 0.7351 | 0.000 | 0.677 | 0.323 |
... | ... | ... | ... | ... | ... |
4995 | chris that's great hear :) due time reminder i... | 0.7964 | 0.000 | 0.608 | 0.392 |
4996 | thanks shout-out :) great aboard | 0.8779 | 0.000 | 0.083 | 0.917 |
4997 | hey :) long time talk ... | 0.4588 | 0.000 | 0.625 | 0.375 |
4998 | matt would say welcome adulthood ... :) | 0.7184 | 0.000 | 0.455 | 0.545 |
4999 | could say egg face :-) | 0.3182 | 0.000 | 0.635 | 0.365 |
5000 rows × 5 columns
import seaborn as sns
fig,axs = plt.subplots(2,2, figsize = (30,20))
sns.histplot(df_pos['compound'] , ax = axs[0,0], color ='#011d59' , alpha = 0.75, stat='density')
sns.kdeplot(df_pos['compound'], color='crimson', ax=axs[0,0])
axs[0,0].legend(labels=["Compound"], title = "Positive Tweets")
sns.histplot(df_pos['neg'] , ax=axs[0,1] , color ='#a95927' , alpha = 0.75, stat='density')
sns.kdeplot(df_pos['neg'], color='crimson', ax=axs[0,1])
axs[0,1].legend(labels=["Negative"], title = "Positive Tweets")
sns.histplot(df_pos['neu'] , ax=axs[1,0], color ='#581845' , alpha = 0.75, stat='density')
sns.kdeplot(df_pos['neu'], color='crimson', ax=axs[1,0])
axs[1,0].legend(labels=["Neutral"], title = "Positive Tweets")
sns.histplot(df_pos['pos'] , ax=axs[1,1] , color ='#4da13f' , alpha = 0.75, stat='density')
sns.kdeplot(df_pos['pos'], color='crimson', ax=axs[1,1])
axs[1,1].legend(labels=["Positive"], title = "Positive Tweets")
plt.show()
df_neg['compound'] = [sid.polarity_scores(x)['compound'] for x in df_neg['negative_tweets']]
df_neg['neg'] = [sid.polarity_scores(x)['neg'] for x in df_neg['negative_tweets']]
df_neg['neu'] = [sid.polarity_scores(x)['neu'] for x in df_neg['negative_tweets']]
df_neg['pos'] = [sid.polarity_scores(x)['pos'] for x in df_neg['negative_tweets']]
df_neg
negative_tweets | compound | neg | neu | pos | |
---|---|---|---|---|---|
0 | hopeless tmr :( | -0.7096 | 0.855 | 0.145 | 0.000 |
1 | everything kid section ikea cute shame i'm nea... | -0.4588 | 0.353 | 0.471 | 0.176 |
2 | heart slide waste basket :( | -0.6908 | 0.655 | 0.345 | 0.000 |
3 | “ hate japanese call ban :( :( ” | -0.9201 | 0.868 | 0.132 | 0.000 |
4 | dang start next week work :( | -0.4404 | 0.367 | 0.633 | 0.000 |
... | ... | ... | ... | ... | ... |
4995 | wanna change avi usanele :( | -0.4404 | 0.420 | 0.580 | 0.000 |
4996 | puppy broke foot :( | -0.6908 | 0.740 | 0.260 | 0.000 |
4997 | where's jaebum baby picture :( | -0.4404 | 0.420 | 0.580 | 0.000 |
4998 | mr ahmad maslan cook :( | -0.4404 | 0.420 | 0.580 | 0.000 |
4999 | hull supporter expect misserable week :-( | -0.1027 | 0.291 | 0.465 | 0.244 |
5000 rows × 5 columns
fig,axs = plt.subplots(2,2, figsize = (30,20))
sns.histplot(df_neg['compound'] , ax = axs[0,0], color ='#011d59' , alpha = 0.75, stat='density')
sns.kdeplot(df_neg['compound'], color='crimson', ax=axs[0,0])
axs[0,0].legend(labels=["Compound"], title = "Negative Tweets")
sns.histplot(df_neg['neg'] , ax=axs[0,1] , color ='#a95927' , alpha = 0.75, stat='density')
sns.kdeplot(df_neg['neg'], color='crimson', ax=axs[0,1])
axs[0,1].legend(labels=["Negative"], title = "Negative Tweets")
sns.histplot(df_neg['neu'] , ax=axs[1,0], color ='#581845' , alpha = 0.75, stat='density')
sns.kdeplot(df_neg['neu'], color='crimson', ax=axs[1,0])
axs[1,0].legend(labels=["Neutral"], title = "Negative Tweets")
sns.histplot(df_neg['pos'] , ax=axs[1,1] , color ='#4da13f' , alpha = 0.75, stat='density')
sns.kdeplot(df_neg['pos'], color='crimson', ax=axs[1,1])
axs[1,1].legend(labels=["Positive"], title = "Negative Tweets")
plt.show()
plt.show()