top of page

Text Mining and sentiment analysis on Amazon Kindle reviews data

  • Writer: Kaivalya Kandukuri
    Kaivalya Kandukuri
  • Aug 2, 2018
  • 4 min read

Updated: Aug 3, 2018

Data is everywhere. And it is truly fascinating to see what kind of things you can achieve with the data around you.

I was bored this summer and being the avid reader that I am, I am always on the lookout for new books to read. Tired of all the recommendations form friends, Goodreads and various other websites, I decided to do some exploratory data analysis on the reviews for Kindle ebooks on Amazon, to find out what I want to read next. This is how my efforts went –

Data and description:

I was thinking f web scraping the Amazon website for reviews, but I was fortunate enough to find a dataset on Kindle ebook reviews. The data consisted of 3374 reviews on 298 books. It had the following columns –

· asin – The book ID. The title of the book wasn’t given, I had to figure that out.

· overall – The overall rating of the book.

· reviewerID

· reviewText – This was the attribute that I performed text mining and sentiment analysis on.

Basic Exploratory Data Analysis:

The data was clean and had no missing values. It didn’t require any extra cleaning.

Using the dplyr package for basic EDA, I first grouped all the book IDs and found the average rating of each book. The output table was as follows –



I then found out the top 10 books with the best overall average rating. I also found out the least rated books.

Text mining:

I used the ‘tm’ package in R for text mining. I was curious to see how the words used in the reviews differed from the best rated and the worst rated books. So, I created two Term Document Matrices (TDMs), one each for the top 10 and the bottom 10 books. Then, I generated a wordcloud (from the wordcloud package in R) to look at the words. A wordcloud is an interesting way to look at the words in a corpus. In a wordcloud, the most used words, based on the frequency of the term in the corpus are shown. The size of a word is directly proportional to its frequency.

One more thing to mention here is to create a TDM, the text must be in the form of a corpus. A corpus is a collection of documents. Here, a document refers to each row in the reviewText column. That means each review is a document and a collection of all these reviews is called a corpus. The corpus was then cleaned. All the letters were converted to lower case, the extra white space was stripped and all the stop words (words like ‘a’, ’the’, ’is’, ’was’ etc.,) were removed using appropriate ‘tm’ functions. This was then sent to generate term document matrix and the document term matrix (just a transpose of a TDM where the documents are rows and the terms are columns).

The wordcloud generated for the top 10 books is as follows –



There are a lot of positive terms here, like great, love, right etc.

On the other hand, the worst 10 books’ wordcloud was as follows –



Words like annoying, bad etc., can be seen here. I wonder who Allesandro is! :P Also, there are a lot more words in the negative word cloud than the positive one. Maybe the readers were frustrated and put poured their heart out while writing these reviews.

Sentiment Analysis:

I knew I wanted to choose one of the 10 best rated books I had shortlisted from the 298. But I just wanted to narrow down on one book to and for this I performed sentiment analysis. The book whose reviews had the most positive sentiment would be my next summer read.

I performed sentiment analysis the tidy way. I first filtered my data to have rows only for the best 10 rated books. Using the ‘unnest_tokens’ function from the ‘tidytext’ package in R, I separated all the words from the review so that there was only one word per row. I then performed an inner join with ‘afinn’ sentiment lexicon to calculate the sentiment scores.

A sentiment lexicon is a collection of words with their sentiment scores. The scores could be calculated on various ways. For example, the ‘afinn’ lexicon gives a word, a score ranging from -5 to 5 based on its sentiment from most negative to most positive. A ‘bing’ lexicon on the other hand, classifies a word into either a positive or a negative category. All these lexicons are also available in the tidytext package.

I then averaged the sentiment of the top 10 books from all its reviews and the book with the ID B001E50WMG had the highest sentiment score of about 2.326.

Finding the book:

Now, another challenge lay ahead of me. How do I know what book it is? I tried looking for the Book ID on Amazon.com, but it didn’t return any results. Again, wordclouds came to my rescue. I assumed the name of the book or the author must be mentioned in atleast one of the reviews. So I generated a wordcloud of only the reviews of whatever book I wanted to read and it looked like this –



From this wordcloud, I thought the word Debbie seemed important (could be a character in the book or the author). Also I realized that the reviewers were talking about a book series and maybe it oculd have something to do with a cove, cedar and a lighthouse.

Typing all these ideas into the search bar on the amazon.com page gave me this result –

To the lighthouse (Cedar Cove Series Book 1) by Debbie Macomber

This series has also apparently been made into a TV series. So, there I had it. My next summer read.

If there are any book lovers out there, why don’t you also give it a try? I’ll read this and maybe write a review on my personal blog. Happy reading to me! 😊

Also, if you are interested, you can check out my code on my Github page –

 
 
 

Recent Posts

See All

Comentarios


bottom of page