Oscar Wilde is one of my favourite authors ever. 'The Picture of Dorian Gray ' published in 1890, remains as one of the stories that has resonated and impacted me the most. The idea of a painting granting you a carte blanche and being the receptacle of ones evil while you live a life devoid of consequences was fascinating to me. In particular, the gradual deterioration of the painting as Dorian Gray committed increasingly terrible acts will always be one of my favorite concepts in literature.

As such, it's only natural for me to want to investigate them through the lenses of data science and natural language processing (NLP) techniques.

NLP is an amalgam of the concepts and techniques from the fields of linguistics, computer science, and artificial intelligence that seeks to understand how computers process human language. Some common tasks used by NLP are text and speech processing used in Speech recognition applications (i.e., Siri from Apple and Alexa from Amazon) and Higher-level applications like Discourse management (i.e., how can a computer converse with a human) amongst many others. It is an absolutely fascinating field and one that I've been deeply interested in from the moment I learned about it.

Some questions I'd want to answer are:

  • What are the most commonly used words used by Oscar Wilde?
  • What is the overall sentiment present in his works?
  • How does the sentiment change (if at all) from chapter to chapter? Is there a particular emotional structure that is present?
  • Are the emotional assessments in agreement with the themes present in the narratives?

Project Gutenberg is a library of over 60,000 free eBooks that one can download or read online. Oscar Wilde is one of the many authors that Project Gutemberg has works readily available for.

I'll write some code to download his works from Project Gutenberg. These routines should be readily applicable to get stuff from other authors too!

This project will show more of my code directly since I will be doing a bit of querying and webscraping to get the data I need which I think is worth showing. If you are interested in seeing the full set of code you can see my post here or my repository here.

In terms of the sentiment/emotion analysis, I'll start by using a Naive Bayes classifier trained on Twitter data to determine whether the sentiments in a chapter are positive, negative or neutral. Then, I'll use VADER to get polarity scores. Finally, I'll use the NRC lexicon to breakdown each chapter into 10 different emotions (joy, positive, anger, disgust, negative, anticipation, surprise, trust, fear, and surprise).

With that said let's dive in!

Prelude and Exposition - Structuring the Query for Oscar Wilde in Project Gutemberg

The first thing I'll need is a way to retrieve the contents from the HTML site of Project Gutenberg. This means I'll need to communicate with the site/server which I can readily do with the urlopen function. The URL that I'm trying to connect to is associated with a search query for 'Oscar Wilde' and it looks as follows:

https://www.gutenberg.org/ebooks/search/?query=oscar+wilde&submit_search=Search

I'll be replacing the string to make it more general by using a formatted string of the form

f'https://www.gutenberg.org/ebooks/search/?query={first_name}+{last_name}&submit_search=Search'

I should note that Project Gutenberg queries are limited to 25 search results but that should be enough for this initial exploration.

Now that I know what URL to query, I can start to build the functions required to get the info I need from Project Gutemberg.

Step 1 - Communicating With the Query URL

Now that I know the structure of the URL that I need to query I can write a function that will communicate with the site as shown below:

Step 2 - Webscraping the Query URL for ALL URLs in page

Next, I need a way to retrieve the URLs associated with book files from the search query. BeautifulSoup allows us to do this easily since all book links are contained within the 'a href' tag. Hence, all that needs to be done is tell BeautifulSoup to find all the tags that begin with 'a' and then get the parts that contain an actual link that's denoted by 'href'. The function below takes care of that.

Step 3 - Get Book Identifiers

Now that we have the links present in the search query, I need to get the book identifiers that Project Gutenberg uses. This is done via the function below which will return a list of bookIDs that we can then use to download all the books associated with the results of our search query.

Step 4 - Downloading A Single Book from Project Gutemberg

We are almost done with the extraction prep for our books! Now that I have the bookIDs associated with the query results I can write a function that will download the book onto a directory in our computer. This is done with the function below.

Let's test it out! The book ID for 'The Picture Of Dorian Gray' is 174. So I'll just pass that into my function and the result is a nice file in my directory called '174.txt' that contains 'The Portrait Of Dorian Gray'. Awesome!

Perfect! Now that I have my base download function I can simply put my routines into a wrapper function that will allow me to download all the books present in our search query page as shown below.

Rising Action - Downloading Oscar Wilde's Literary Works from Project Gutemberg

Here is the wrapper function that will allow me to download all the books in the Oscar Wilde search query page.

The results of this running this function are shown below.

Our function returned 23 out of the 25 results that were on the search query page. The 2 results that were omitted corresponded to an audio book for the 'The Importance of Being Earnest' and the German version of 'The Portrait of Dorian Gray' which is titled 'Das Bildnis des Dorian Gray'. This is totally okay though since for my work I'm only interested in text data and documents that are in English so I'll proceed with the next stage of the process.

Breaking the Book into Chapters

Now that I have the books downloaded, I'm going to write a routine that will split each book into Chapters. This will allow me to process and analyze each book chapter individually.

Next I'll load the book chapter files that we have generated to have them accessible

Finally, I'll read the contents of the book and place them in memory here. I'll start with Chapter 1 which was saved as a file named '2.txt' since the current book has a Prelude Chapter

The output of this routine is shown below:

We are ready to roll!

The Midpoint - Tokenizing, Lemmatizing, and Denoising A Chapter

I now have the entire book at my disposal to work with. However, there's a bit of preparation I need to do on the text data before I can carry out the sentiment analysis. Before I move on to processing an entire book, I want to ensure that all my routines will work as intended on a chapter. This is how I like to code, learn, design experiments and do research in general. Start with a simple foundation that I can build as much or as little detail and complexity as needed.

I'll start with Chapter 1 from 'The Picture of Dorian Gray'.

Tokenizing The Chapter

Tokenizing is the process of spliting strings into smaller parts called tokens. A token is the smallest unit of text that can be parsed by the computer (analogues to phonemes and morphemes in human languages). Tokens can be any sequence of words, emojis, hashtags, links, or even an individual character. It is an important part of the preparation of sentiment analysis since it provides the framework from which the computer will try to ascribe meaning.

Let's start by tokenizing the chapter. I'll use the word_tokenize function from nltk. I'll also remove the first three elements from the tokenized list since those are just the chapter numbers and empty lines.

The output of this process is shown below where it can be seen that the tokenizer generated an array of words from the chapter:

Normalizing the Words in a Chapter Via Lemmatization

Words can be expressed in a variety of ways to convey meaning. For instance, ate and eaten are different tenses of the verb eat. An analysis may require these different tenses to be converted into their base, fundamental, canonical form. This process is known as normalization and is commonly achieved via stemming and/or lemmatization algorithms.

In lemmatization, the algorithm normalizes a word within the context of vocabulary and through the use of morphological analysis in order to produce a lemma. A lemma in morphology is known as the canonical or dictionary form of a set of word forms. This lemmatization process allows different words to be indexed,mapped, or traced back to a single word. For instance, speak, speaks, spoke, spoken and speaking are all forms of speak (known as the lexeme).

Through lemmatization, I'll be able to tag words and denote them as nouns, verbs, prepositions, and others. This can be readily done with nltk as shown below

The output of this is shown below:

That worked! Every tokenized word has a tag associated with it now. Notice how there is a comma token after 'rose'. I'll remove that and other things as part of the denoising/cleanup process.

Denoising a Chapter

I'm almost done preparing the chapter to carry out sentiment analysis. Before I do so however, I need to remove punctuations, stop words, and special characters that are muddying the data before continuing. This process is known as denoising.

To denoise data, I'll be using regular expressions (regexs) through the function below:

The result of running this function on the chapter is shown below:

Lovely! Now we are ready to start doing analysis on this chapter!

Calculating Word Density for Chapter

Now that the chapter is tokenized, lemmatized and denoised, let me find out what are the top 20 most frequently occurring words in the chapter.

Looks like the most common word in this chapter is 'say' with 29 instances, followed by the word 'one' with 27 instances.

The Crisis - Naive Bayes Classifier

Now let's train a model! I'll start by using a Naive Bayes Classifier based on the twitter_samples dataset and see how that works out. I've covered this in a previous post so I'll just include the results here. This will allow me to readily classify words as being associated with positive or negative sentiments. I will later try to classify the words, sentences, chapters and entire book with more detailed emotions like joy, sadness, anger, and such.

The results of the classifier are shown below:

The classifier has a 99.2% classification accuracy. Let's run it on the model!

I'll prepare the chapter to be classified via the Naive Bayes, which requires me to convert the denoised chapter into a dictionary where each word will be paired with a True/False value using the function below.

Then, I can run the classifier on the chapter dictionary I just made and the output will be whether the chapter is considered to be 'Positive' or 'Negative'

The output of running the Naive Bayes classifier for this chapter turned out to be 'Negative'. This is a sentiment that I think I can agree with, though there is a lot more to be said about the emotions within this chapter. Regardless, now I'm ready to apply the routines developed so far to the entire book!

The Big Event - Tokenizing, Lemmatizing, and Denoising A Book

Now I'm ready to process the entire book and determine the dominating sentiment present in each chapter using the wrapper function below.

The results of running this routine are shown below:

Yikes! The classifier determined that every chapter in the book is negative... Though I'm inclined to agree with that sentiment, it is a pretty boring result that oversimplifies the variety of nuances that make the book interesting. As such, I decided to look into the VADER and NRC lexicons to get additional color into the emotions present in the chapter.

The Climax - VADER and NRC Lexicon

VADER will give polarity scores. Even though the sentiment categorization is still set to a simple positive, negative, or neutral, we will get a quantifiable metric for each of those values and start getting a better idea of the sentiment composition from each chapter. I'll first modify my wrapper routine from above so that it returns the chapter contents as part of the output.

Next, I'll create a pandas dataframe that will hold the contents from each chapter on each row using the code below.

The resulting dataframe for the book is shown below:

And now I'll run the SentimentIntensityAnalyzer with VADER and save the polarity scores for each chapter into the dataframe using the code below:

The resulting dataframe is shown below:

Okay, now things are starting to get a bit more interesting. Each chapter has different positive, negative and neutral scores associated with them. Furthermore, the compound scores (which represent the overall sentiment with +1 being the most positive and -1 being the most negative) are also different between chapters. Let me plot them and see how it looks like!

From these plots we can see the following:

  • Negative feelings appear to trend upwards over the course of the book and maximize in the final chapters
  • Positive feelings appear to trend downwards over the course of the book and reach their lowest on Chapter 13
  • Neutral feelings appear to trend upwards over the course of the book.
  • Most chapters have a compound score that is positive with a few noticeable exceptions (Chapters 13, 16, and 18)
  • Neutral feelings have the highest scores out of the 3 types of polarity scores

Hmmm, the sentiment assesment between the Naive Bayes classifier and the VADER polarity scores are opposite to one another. Why? Also, what happened in chapters 13, 16 and 18 that caused VADER to categorize those chapters as predominantly negative?

Chapter 13 for instance, is one of the most climatic chapters in the book. This is the chapter where Dorian Gray unveils the portrait to Basil after not having seen him for 18 years. The revelation of the shocking state of the portait confirms that all the awful rumours about Dorian Gray that Basil had heard about are indeed true. Upon realizing that Dorian is indeed a terrible person, Basil asks him in what is perhaps the one truly pure act in the book to seek forgiveness from all the bad things that he's done. Finally, the portrait pushes Dorian to murder Basil to which he responds with a chilling sense of calmness and indifference.

Clearly, a negative sentiment is not enough to convey the full extent of negative emotions that are happening in this chapter. Sure, the chapter is negative, but it is the details that make it interesting! This is where the NRC lexicon comes into play.

NRC Lexicon for Emotion Recognition

Now let's try using the NRC lexicon which has 10 different emotions it can categorize text data into. The emotions are: joy, positive, anticipation, sadness, surprise, negative, andger, disgust, trust, and fear. The NRC lexicon is composed of over 27,000 words at the time of writing.

I'll use a similar strategy to what I used for VADER polarity scores to get the scores for the various emotions for each chapter into the book dataframe. However, I'll normalize the emotion scores by the chapter word count since I noticed that that was inflating the scores.

The pandas dataframe now looks like this:

Beautiful! Let's start to see how these different emotions interact with one another and progress over the course of the book!

Correlation Matrix for Emotions in 'The Portrait of Dorian Gray'

The correlation matrix below shows the interactions between the word count normalized emotion scores obtained from the NRC lexicon analysis. The colorscale indicates whether the emotions change positively or negatively with respect to one another. Brighter colors indicate positive trends. Darker colors indicate negative trends.

What do we make of these correlations however? Oscar Wilde did say it best...

The correlations between these sentiment scores could be seen as the degree or the extent to which an author considers an emotion pair to be similar to one another. At the very least, the emotion correlations could be interpreted as the way a writer intermixes emotions in a particular piece of literature. The next section will visualize the behavior of each emotion over the course of the book.

The Realization - Visualization of the Emotional Analysis

In the previous section I showed the correlation matrix between all of the emotions that we obtained from the NRC analysis. Let's start seeing how some of those correlations look like!

Joy

Joy is defined as 'a feeling of great pleasure and happiness'. It rises from Chapters 1-6 where it peaks and subsequently declines until reaching its minimum in Chapter 10. It then oscillates for the remainder of the book. Joy is mostly correlated with feelings of positive and anticipation and mostly anticorrelated with feelings of fear.

Positive

Positive sentiments are defined as 'a good, affirmative, or constructive quality or attribute'. Positive feelings have an overall gradual decay for the first 13 chapters after which the positive feelings start to climb up again, but they never reach the initial levels. Positive is mostly correlated with feelings of joy and mostly anticorrelated with feelings of fear.

Anticipation

Anticipation is defined as 'excitement about something that's going to happen'. Anticipation rises between Chapters 1-5 and then gradually drops until Chapter 10. After that, it gradually increases once again before dropping at the end. Anticipation is mostly correlated with joy and mostly anticorrelated with disgust and fear

Sadness

Sadness is defined as 'an emotional pain associated with, or characterized by, feelings of disadvantage, loss, despair, grief, helplessness, disappointment and sorrow'. Sadness generally trends upwards until Chapter 8 before decreasing until Chapter 15. It then jumps and levels off between Chapters 16-19 before dropping at Chapter 20. Sadness is mostly correlated with feelings of surprise and mostly anticorrelated with feelings of trust.

Surprise

Surprise is defined as 'to cause to feel wonder or amazement because of being unexpected'. Surprise is fairly constant until Chapter 9 before suddenly dropping in Chapter 10. After this, surprise increases intil the penultimate Chapter before dropping in Chapter 20. Surprise is mostly correlated with feelings of anticipation and mostly anticorrelated with feelings of fear and disgust

Negative

Negative is defined as 'marked by absence, withholding, or removal of something positive'. Negative feelings are fairly constant over the course of the book. They peak in Chapters 16 and 18. Negative is mosly correlated with feelings of anger and mostly anticorrelated with feelings of trust

Anger

Anger is defined as 'a strong feeling of displeasure or annoyance and often of active opposition to an insult, injury, or injustice'. Anger behaves in a wavelike manner. It gradually rises until Chapter 8 before mildly decaying until Chapter 15. It suddenly jumps in Chapter 16 and then decays until the end of the book. Anger is mostly correlated with negative feelings and mostly anticorrelated with feelings of trust

Disgust

Disgust is defined as 'a feeling of revulsion or strong disapproval aroused by something unpleasant or offensive'. Feelings of disgust rise from the Prelude until Chapter 2 and subsequently decay until Chapter 5. This behavior repeats once again where it rises until Chapter 7 before dropping until Chapter 9. Feelings of disgust are fairly constant between Chapters 10-15. Disgust peaks in Chapter 16 and gradually decays until the end of the book. Disgust is mostly correlated with feelings of fear and mostly anticorrelated with feelings of positive

Trust

Trust is defined as the 'firm belief in the reliability, truth, ability, or strength of someone or something'. Feelings of trust increase between the prelude until Chapter 3. They drop in Chapter 4 and start rising again until Chapter 6. The feelings of trust drop in Chapter 7 and remain fairly constant over the remainder of the book. Trust is mostly correlated with feelings of positive and mostly anticorrelated with feelings of fear

Fear

Fear is defined as 'an unpleasant emotion caused by the belief that someone or something is dangerous, likely to cause pain, or a threat'. Fear trends upward from Prelude until Chapter 8. After that, fear decreases gradually until Chapter 15. It surges in Chapter 16 and 18 before dying down in the last two Chapters. Fear is mostly correlated with feelings of negative and mostly anticorrelated with feelings of positive

Everything together

Nice! That gives us an idea of how these different emotions progress and interact with one another over the course of the book. It's pretty interesting to see how these different sentiments interplay with one another throughtout the narrative. The stacked bar plot below summarizes my findings for 'The Picture of Dorian Gray'. The x-axis represents each Chapter in the book while the y-axis represents the precentage of the emotion present in each chapter of the book. The colors represent each of the 10 different emotions processed by the NRC lexicon. Lighter colors are associated with positive emotions while darker colors are associated with more negative emotions. The size of each emotion represents how much it contributes to the total emotion score for that chapter.

This is a lot more nuanced than the simple Naive Bayes classification of plain positive or negative feelings that I tried earlier!

Word Cloud

As a final visualization here is a word cloud plot for 'The Picture of Dorian Gray'. The bigger the word the more frequently it appears. From this plot, we can readily see that some of the most common words in the books are: one, know, life, go, and look.

Conclusion

Sentiment Analysis and Emotion Recognition was used to analyze 'The Portrait of Dorian Gray' by Oscar Wilde. The results of the routine showcased here elucidated some of the thematic/emotional structure of that book using a Naive Bayes classifier, VADER and NRC lexicons for every chapter in the book.

The next steps for this project involve processing the remaining works from Oscar Wilde and seeing if there is a common emotional structure in his books. In addition to this, I would like to explore other authors and genres and see if I can extract common sentiment structures! Maybe this would provide a bit of a formulaic way for a writer (or potentially an AI) to write in specific ways.

This was a fun and intersting project to work on and I hope you got something out of it. Thanks for reading!