Text Analytics - Sentiment Analysis

  • This tutorial explains how to infer sentiments expressed in tweets. To train the model, a training dataset available on kaggle site is taken.
  • Then the tweets for the movie Azhar is classifies using the model
  • The steps for build the model are
    • Load the training data set
    • Tokenize the sentences in the training data set
    • Remove the stop words
    • Create a Document Matrix with TF-IDF Values, which takes each word as features and the TF-IDF score as values for those features.
    • Split the dataset into train and test
    • Build the model
    • Evaluate the accuracy of the model
    • Apply the model to Azhar dataset
    • Store the results into a csv file
In [1]:
import pandas as pd
import numpy as np
In [2]:
train_ds = pd.read_csv( "sentiment_train", delimiter="\t" )
In [3]:
train_ds.head( 10 )
Out[3]:
sentiment text
0 1 The Da Vinci Code book is just awesome.
1 1 this was the first clive cussler i've ever rea...
2 1 i liked the Da Vinci Code a lot.
3 1 i liked the Da Vinci Code a lot.
4 1 I liked the Da Vinci Code but it ultimatly did...
5 1 that's not even an exaggeration ) and at midni...
6 1 I loved the Da Vinci Code, but now I want some...
7 1 i thought da vinci code was great, same with k...
8 1 The Da Vinci Code is actually a good movie...
9 1 I thought the Da Vinci Code was a pretty good ...
  • Some of the texts have been cut and not shown properly by the display. To change that set max_colwidth parameter as per your screen size in pixels
In [4]:
pd.set_option('max_colwidth', 800)
  • Let's look at some negative sentiment tweets
In [6]:
train_ds[train_ds.sentiment == 0][0:10]
Out[6]:
sentiment text
3943 0 da vinci code was a terrible movie.
3944 0 Then again, the Da Vinci code is super shitty movie, and it made like 700 million.
3945 0 The Da Vinci Code comes out tomorrow, which sucks.
3946 0 i thought the da vinci code movie was really boring.
3947 0 God, Yahoo Games has this truly-awful looking Da Vinci Code-themed skin on it's chessboard right now.
3948 0 Da Vinci Code does suck.
3949 0 And better...-We all know Da Vinci code is bogus and inaccurate.
3950 0 Last time, Da Vinci code is also a bit disappointing to me, because many things written in the book is never mentioned in movie.
3951 0 And better...-We all know Da Vinci code is bogus and inaccurate.
3952 0 And better..-We all know Da Vinci code is bogus and inaccurate.
In [7]:
train_ds.shape
Out[7]:
(6918, 2)

Create a dictionary of features

  • Create a dictionary of features and then count the feature values for each sentence
  • All words available in all sentences become part of the dictionary
  • We will limit number of features to be used in this tutorial to 5000
In [9]:
from sklearn.feature_extraction.text import CountVectorizer
In [10]:
count_vectorizer = CountVectorizer( max_features = 5000 )
In [15]:
feature_vector = count_vectorizer.fit( train_ds.text )
train_ds_features = count_vectorizer.transform( train_ds.text )
In [12]:
features = feature_vector.get_feature_names()

What are the features extracted?

In [14]:
features[0:10]
Out[14]:
['00', '007', '10', '10pm', '12', '16', '17', '1984', '1st', '200']

Count occurance of these features across all sentences

In [11]:
features_counts = np.sum( train_ds_features.toarray(), axis = 0 )
In [12]:
feature_counts = pd.DataFrame( dict( features = features,
                                  counts = features_counts ) )
In [13]:
feature_counts.head(5)
Out[13]:
counts features
0 1 00
1 1 007
2 4 10
3 1 10pm
4 1 12
In [17]:
feature_counts.sort( "counts", ascending = False )[1:20]
Out[17]:
counts features
93 2154 and
864 2093 harry
1466 2093 potter
355 2002 code
2009 2001 vinci
442 2001 da
1272 2000 mountain
259 2000 brokeback
1171 1624 love
1018 1520 is
2029 1176 was
151 1127 awesome
1252 1094 mission
977 1093 impossible
1132 974 like
1022 901 it
1916 808 to
1275 783 movie
1862 719 that
  • The features list contains quite a few stop words, which may have no meaning. Let's remove the words and create a new dictionary
In [24]:
count_vectorizer = CountVectorizer( stop_words = "english",
                                 max_features = 5000 )
feature_vector = count_vectorizer.fit( train_ds.text )
train_ds_features = count_vectorizer.transform( train_ds.text )
In [25]:
features = feature_vector.get_feature_names()
features_counts = np.sum( train_ds_features.toarray(), axis = 0 )
feature_counts = pd.DataFrame( dict( features = features,
                                  counts = features_counts ) )
feature_counts.sort( "counts", ascending = False )[0:20]
Out[25]:
counts features
1328 2093 potter
790 2093 harry
314 2002 code
1823 2001 vinci
399 2001 da
1167 2000 mountain
223 2000 brokeback
1074 1624 love
126 1127 awesome
1150 1094 mission
892 1093 impossible
1035 974 like
1169 783 movie
1646 602 sucks
1644 600 sucked
792 578 hate
1393 374 really
1170 366 movies
1637 365 stupid
967 287 just

Build a naive-bayes classifier

In [26]:
from sklearn.naive_bayes import GaussianNB
from sklearn.cross_validation import train_test_split
In [27]:
clf = GaussianNB()
In [28]:
train_X, test_X, train_y, test_y = train_test_split( train_ds_features,
                                                  train_ds.sentiment,
                                                  test_size = 0.3,
                                                  random_state = 42 )
In [29]:
clf.fit( train_X.toarray(), train_y )
Out[29]:
GaussianNB()

Test the model

In [30]:
test_ds_predicted = clf.predict( test_X.toarray() )

Evaluate the model

In [31]:
from sklearn import metrics
In [32]:
cm = metrics.confusion_matrix( test_y, test_ds_predicted )
In [33]:
cm
Out[33]:
array([[ 809,   64],
     [  19, 1184]])
In [34]:
import matplotlib as plt
import seaborn as sn
%matplotlib inline
In [39]:
sn.heatmap(cm, annot=True,  fmt='.2f' );
In [40]:
score = metrics.accuracy_score( test_y, test_ds_predicted )
In [41]:
score
Out[41]:
0.96001926782273606

Seems to be a good model. Accuracy is quite good.

Let's load the azhar movie data and classify

In [60]:
# read the entire file into a python array
with open('azhar.json', 'r') as f:
  data = f.readlines()

# remove the trailing "\n" from each line
data = map(lambda x: x.rstrip(), data)
In [61]:
data_json_str = "[" + ','.join(data) + "]"
In [62]:
azhar_df = pd.read_json(data_json_str)
In [64]:
azhar_df.head( 1 )
Out[64]:
contributors coordinates created_at entities extended_entities favorite_count favorited filter_level geo id ... quoted_status_id quoted_status_id_str retweet_count retweeted retweeted_status source text timestamp_ms truncated user
0 NaN NaN 2016-05-13 08:26:27 {'symbols': [], 'urls': [], 'user_mentions': [{'name': 'BookMyShow', 'id': 10650592, 'id_str': '10650592', 'screen_name': 'bookmyshow', 'indices': [3, 14]}, {'name': 'Azhar', 'id': 3192221821, 'id_str': '3192221821', 'screen_name': 'AzharTheFilm', 'indices': [61, 74]}], 'hashtags': [{'text': 'Azhar', 'indices': [33, 39]}, {'text': 'AzharToday', 'indices': [49, 60]}]} NaN 0 False low NaN 731037880219504640 ... NaN NaN 0 False {'id': 731037553357283328, 'is_quote_status': False, 'in_reply_to_status_id_str': None, 'favorite_count': 37, 'lang': 'en', 'retweet_count': 41, 'coordinates': None, 'created_at': 'Fri May 13 08:25:09 +0000 2016', 'in_reply_to_screen_name': None, 'in_reply_to_status_id': None, 'retweeted': False, 'geo': None, 'truncated': False, 'text': '8. Name the city #Azhar is from? #AzharToday @AzharTheFilm', 'id_str': '731037553357283328', 'contributors': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'filter_level': 'low', 'source': '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>', 'place': None, 'user': {'id': 10650592, 'profile_link_color': '0099CC', 'time_zone': 'Mumbai', 'protected': False, 'statuses_count': 26221, 'profile_background... <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a> RT @bookmyshow: 8. Name the city #Azhar is from? #AzharToday @AzharTheFilm 1463127987328 False {'id': 130861912, 'profile_link_color': '0084B4', 'time_zone': 'New Delhi', 'protected': False, 'statuses_count': 1237, 'profile_background_color': 'C0DEED', 'utc_offset': 19800, 'following': None, 'screen_name': 'Emmii69', 'lang': 'en', 'default_profile_image': False, 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/723584748564676609/OyUYX9-7_normal.jpg', 'created_at': 'Thu Apr 08 15:08:31 +0000 2010', 'is_translator': False, 'favourites_count': 506, 'profile_image_url': 'http://pbs.twimg.com/profile_images/723584748564676609/OyUYX9-7_normal.jpg', 'listed_count': 1, 'default_profile': True, 'profile_use_background_image': True, 'id_str': '130861912', 'url': None, 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'name': 'pratik...

1 rows × 31 columns

In [65]:
azhar_df = azhar_df[['text']]
In [66]:
azhar_df.head( 5 )
Out[66]:
text
0 RT @bookmyshow: 8. Name the city #Azhar is from? #AzharToday @AzharTheFilm
1 RT @bookmyshow: 3. True Or False: @ItsPrachiDesai plays Naureen in #Azhar? #AzharToday @AzharTheFilm
2 @bookmyshow @ItsPrachiDesai @AzharTheFilm #Azhar #AzharToday Q.3 True
3 RT @bookmyshow: 8. Name the city #Azhar is from? #AzharToday @AzharTheFilm
4 @bookmyshow @AzharTheFilm Q8 Ans : Hyderabad #AzharToday #Azhar
  • There are too many promotional tweets from bookmyshow, let's delete these tweets.
In [67]:
azhar_df = azhar_df[-azhar_df.text.str.contains( "@bookmyshow" )]
In [68]:
azhar_df.head( 2 )
Out[68]:
text
6 RT @bollywood_life: @emraanhashmi hits a sixer with #Azhar. Read our 3 star review:https://t.co/bLmOS5slfW @balajimotionpic https://t.co/09…
9 RT @taran_adaarsh: #AZHAR is Outstanding..Don't miss it!\n@emraanhashmi's career's best performance &amp; his best film by far!! Must Watch film…

Convert these tweets into Document Matrix using the dictionary we created while building the model

In [69]:
azhar_text = count_vectorizer.transform( azhar_df.text )
In [70]:
azhar_text[1]
Out[70]:
<1x1921 sparse matrix of type '<class 'numpy.int64'>'
with 8 stored elements in Compressed Sparse Row format>

Classify the tweets

In [72]:
azhar_df["sentiment"] = clf.predict( azhar_text.toarray() )
In [73]:
azhar_df[0:10]
Out[73]:
text sentiment
6 RT @bollywood_life: @emraanhashmi hits a sixer with #Azhar. Read our 3 star review:https://t.co/bLmOS5slfW @balajimotionpic https://t.co/09… 0
9 RT @taran_adaarsh: #AZHAR is Outstanding..Don't miss it!\n@emraanhashmi's career's best performance &amp; his best film by far!! Must Watch film… 1
10 RT @ursmehreen: Omg! Today is Friday! #Azhar release 😍 how can I be thinking that it's Thursday? 🙈 #oops 0
11 RT @taran_adaarsh: #AZHAR is Outstanding..Don't miss it!\n@emraanhashmi's career's best performance &amp; his best film by far!! Must Watch film… 1
13 RT @girishjohar: #Azhar starts on a comfortable note, strong WOM will ensure it has a good day 1 at the BO.... feedback is encouraging @emr… 1
14 RT @bobbytalkcinema: AZHAR - Interesting twist in the court case is here focussing on Lara Dutta. 0
16 Azhar Movie Review and Rating Hit or Flop Public Talk https://t.co/JFqo2P7WKP https://t.co/FesHpSZOzw 0
17 RT @itimestweets: Live #Azhar review: @emraanhashmi's portrayal of @azharflicks is truly admirable! @EmraanAddicted @TahaAmeen24 @BrotherOf… 0
19 RT @rajcheerfull: Looking forward to #Azhar . Congratulations &amp; best wishes @EkmainaurEktu7 @ItsPrachiDesai @emraanhashmi @azharflicks http… 1
20 @TrollKejri your review on #Azhar #Azharthefilm 0

Store the final results

In [74]:
azhar_df.to_csv( "azhar_sentiments.csv", index = False )