Deep Learning 4: IMDB Classification using Bag of Words and Embeddings

This tutorial will go through steps for building a deep learning model for sentiment Analysis. We will classify IMDB movie reviews as either positive or negative. This tutorial will be used for teaching during the workshop.

The tutorial has taken contents from various places including the tutorial from http://www.hvass-labs.org/ for the purpose of teaching in the deep learning class.

The topics addressed in the tutorial:

  1. Basic exploration of the IMDB movies dataset.
  2. Tokenization, text to sequences, padding and truncating
  3. Building NN Model using Bag Of Words
  4. Building NN Model using Embeddings
  5. Peeping to Word Embeddings

We will be exploring mostly how to use Bag of Words and Word Embeddings vector representation of texts and build plain vanila NN models. In the future tutorials, we will explore RNN, LSTM models in the future.

IMDB Movie Reviews

The dataset is available at https://www.kaggle.com/c/word2vec-nlp-tutorial/data

The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews.

Data Fields

  • id - Unique ID of each review
  • sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews
  • review - Text of the review

Loading the dataset

In [1]:
import pandas as pd
import numpy as np
In [2]:
imdb_df = pd.read_csv('./labeledTrainData.tsv', sep = '\t')
In [3]:
pd.set_option('display.max_colwidth', 500)
imdb_df.head(5)
Out[3]:
id sentiment review
0 5814_8 1 With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle m...
1 2381_9 1 \The Classic War of the Worlds\" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic book. Mr. Hines succeeds in doing so. I, and those who watched his film with me, appreciated the fact that it was not the standard, predictable Hollywood fare that comes out every year, e.g. the Spielberg version with Tom Cruise that had only the slightest resemblance to the book. Obviously, everyone looks for different thin...
2 7759_3 0 The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like ¨Jurassik Park¨, and some scientists resurrect one of nature's most fearsome predators, the Sabretooth tiger or Smilodon . Scientific ambition turns deadly, however, and when the high voltage fence is opened the creature escape and begins savagely stalking its prey - the human visitors , tourists and scientific.Meanwh...
3 3630_4 0 It must be assumed that those who praised this film (\the greatest filmed opera ever,\" didn't I read somewhere?) either don't care for opera, don't care for Wagner, or don't care about anything except their desire to appear Cultured. Either as a representation of Wagner's swan-song, or as a movie, this strikes me as an unmitigated disaster, with a leaden reading of the score matched to a tricksy, lugubrious realisation of the text.<br /><br />It's questionable that people with ideas as to w...
4 9495_8 1 Superbly trashy and wondrously unpretentious 80's exploitation, hooray! The pre-credits opening sequences somewhat give the false impression that we're dealing with a serious and harrowing drama, but you need not fear because barely ten minutes later we're up until our necks in nonsensical chainsaw battles, rough fist-fights, lurid dialogs and gratuitous nudity! Bo and Ingrid are two orphaned siblings with an unusually close and even slightly perverted relationship. Can you imagine playfully...

Data Tokenization

The text data need to be converted into vectors using either bag of words or embeddings model. We will first explore bag of words (BOW) model. In the BOW model, a sentence will be represented as a vector with the words (also called tokens) as dimensions of the vectors.

For the purpose of creating vectors, we need to tokenize the sentences first and find out all unique tokens (words) used across all sentences. The corpus of unquie words used could very large, so we can limit the corpus of tokens by using only the most popular (frequently used) words. In this example, we will use 10000 words.

In [4]:
from keras.preprocessing.text import Tokenizer
Using TensorFlow backend.
In [5]:
all_tokenizer = Tokenizer()
In [6]:
all_tokenizer.fit_on_texts( imdb_df.review )
In [7]:
all_tokenizer.document_count
Out[7]:
25000
In [8]:
len(all_tokenizer.word_counts)
Out[8]:
88582

There are 25000 documents (reviews) and 88582 unique words.

High Frequent Words

In [9]:
list(all_tokenizer.word_counts.items())[0:10]
Out[9]:
[('with', 44122),
 ('all', 23953),
 ('this', 75974),
 ('stuff', 1171),
 ('going', 4094),
 ('down', 3707),
 ('at', 23507),
 ('the', 336148),
 ('moment', 1104),
 ('mj', 30)]

Low frequency words

In [10]:
list(all_tokenizer.word_counts.items())[-10:]
Out[10]:
[('bear\x97and', 1),
 ('unflinchingly\x97what', 1),
 ('acids', 1),
 ("gaye's", 1),
 ('crahan', 1),
 ('guggenheim', 2),
 ('substitutions', 1),
 ("daeseleire's", 1),
 ('shortsightedness', 1),
 ('unfairness', 1)]

We can assume the low frequencey words are rarely used to express sentiments as they have appeared only once across all reviews. And only choose to keep top N (for example 10000) words for our analysis. So, let's tokenize agains with a limit to number of words to 10000.

In [11]:
num_words = 10000
In [12]:
tokenizer = Tokenizer(num_words = num_words)
In [13]:
tokenizer.fit_on_texts( imdb_df.review )

Tokenizer provides 4 attributes that you can use to query what has been learned about your documents:

  • word_counts: A dictionary of words and their counts.
  • word_docs: A dictionary of words and how many documents each appeared in.
  • word_index: A dictionary of words and their uniquely assigned integers.
  • document_count:An integer count of the total number of documents that were used to fit the Tokenizer.

Checking first few words and their counts

In [14]:
import itertools

x = itertools.islice(tokenizer.word_counts.items(), 0, 5)

for key, value in x:
    print(key, value)
with 44122
all 23953
this 75974
stuff 1171
going 4094

Checking words and their indexes

In [15]:
list(tokenizer.word_index.items())[0:10]
Out[15]:
[('clues', 3620),
 ('punch', 2841),
 ('reviewed', 6814),
 ("force's", 68085),
 ('unremittingly', 29449),
 ('dropped', 3451),
 ('mnouchkine', 61031),
 ('overindulgent', 77191),
 ('grossly', 8597),
 ('captured', 1999)]

The indexes are in no order. We can order the words by the index values i.e. starting from 1.

In [16]:
from collections import OrderedDict
In [17]:
words_by_sorted_index = sorted(tokenizer.word_index.items(), 
                                           key=lambda idx: idx[1])
In [18]:
type(words_by_sorted_index)
Out[18]:
list
In [19]:
words_by_sorted_index[0:10]
Out[19]:
[('the', 1),
 ('and', 2),
 ('a', 3),
 ('of', 4),
 ('to', 5),
 ('is', 6),
 ('br', 7),
 ('in', 8),
 ('it', 9),
 ('i', 10)]

Encoding a text using the dictionary of tokens

Finding indexes of the words

In [27]:
tokenizer.word_index['the']
Out[27]:
1
In [28]:
tokenizer.word_index['movie']
Out[28]:
17
In [24]:
tokenizer.word_index['brilliant']
Out[24]:
526
In [34]:
tokenizer.texts_to_sequences( ["The movie gladiator is a brilliant movie"])
Out[34]:
[[1, 17, 8623, 6, 3, 526, 17]]

Encoding all the movie reviews

Now the documents (reviews) will be encoded as per the dictionary.

In [41]:
%%time
sequences = tokenizer.texts_to_sequences(imdb_df.review)
CPU times: user 3.49 s, sys: 15.7 ms, total: 3.5 s
Wall time: 3.51 s

Let's look at the words index sequences for a specific sentence.

In [48]:
imdb_df.review[10:11]
Out[48]:
10    What happens when an army of wetbacks, towelheads, and Godless Eastern European commies gather their forces south of the border? Gary Busey kicks their butts, of course. Another laughable example of Reagan-era cultural fallout, Bulletproof wastes a decent supporting cast headed by L Q Jones and Thalmus Rasulala.
Name: review, dtype: object
In [50]:
np.array(sequences[10:11])
Out[50]:
array([[  48,  567,   51,   32, 1268,    4,    2, 4940, 1867, 5171,   65,
        1919, 1221,    4,    1, 3607, 1993, 6887, 3398,   65,    4,  261,
         157, 1319,  459,    4, 7801,  996, 2652, 6986,    3,  539,  693,
         174, 2847,   31, 2007, 3866, 1526,    2]])

Encode Y Variable

In [51]:
y = np.array(imdb_df.sentiment)
In [52]:
y[0:5]
Out[52]:
array([1, 1, 0, 0, 1])

How many classes available?

In [193]:
imdb_df.sentiment.unique()
Out[193]:
array([1, 0])

Truncate and Pad Sequences

One of the problem in dealing with sentences are they are not of same size. Some sentences will have more words and some will have fewer words. Neural networks take input of same lenghts for training a batch.

So, we need to choose a length or size of input. Larger sentences will have to be truncated and smaller ones need to be padded. But what size or lenght to consider?

We need to take the length which can cover most of the sentences. Only few need to be truncated or padded. For that we will look at the distribution of the word or token lengths.

In [53]:
num_tokens = [len(tokens) for tokens in sequences]
num_tokens = np.array(num_tokens)
In [54]:
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
In [57]:
sn.distplot( num_tokens );
In [58]:
mean_num_tokens = num_tokens.mean()
std_num_tokens = num_tokens.std()
In [59]:
mean_num_tokens
Out[59]:
224.05292
In [60]:
std_num_tokens
Out[60]:
164.12699046614364

if we assume that legnth chosen should address 95% of the sentences, then we can take 2 standard deviation of the mean length.

In [63]:
max_review_length = int(mean_num_tokens + 2 * std_num_tokens)
In [64]:
max_review_length
Out[64]:
552

How many sentences will not be truncated at all?

In [65]:
np.sum(num_tokens < max_review_length) / len(num_tokens)
Out[65]:
0.94523999999999997

Almost 95%.

Now we will pad or truncate. But padding or truncating can be done at the beginning of the sentence or at the end of the sentences. pre or post can be used to specify the padding and truncating the beginning or end of sentence.

In [69]:
from keras.preprocessing.sequence import pad_sequences 
In [70]:
pad = 'pre'
In [71]:
X = pad_sequences(sequences, 
                  max_review_length, 
                  padding=pad, 
                  truncating=pad)
In [72]:
X[0:1]
Out[72]:
array([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,   16,   29,   11,  535,  167,  177,   30,    1,  558,   16,
         204,  642, 2615,    5,   24,  225,  146,    1, 1028,  659,  130,
           2,   47,  293,    1,    2,  293,  171,  276,   10,   40,  178,
           5,   76,    3,  810, 2616,   80,   11,  229,   34,   10,  194,
          13,   63,  643,    8,    1, 4252,   40,    5,  276,   94,   53,
          58,  327,  723,   26,    6, 2512,   39, 1351,    6,  170, 5034,
         170,  788,   19,   60,   10,  374,  167,    5,   64,   30,    1,
         434,   51,    9,   13, 1816,  622,   46,    4,    9,   44, 1299,
        3432,   41,  544,  946,    1, 3513,    2,   79,    1,  574,  746,
           4, 1664,   23,   75,    7,    7, 2006, 1156,   18,    4,  261,
          11,    6,   29,   41,  485, 1878,   35,  891,   22, 2588,   37,
           8,  550,   92,   22,   23,  167,    5,  780,   11,    2,  166,
           9,  354,   46,  200,  680,   32,   15,    5,    1,  228,    4,
          11,   17,   18,    2,   88,    4,   24,  448,   59,  132,   12,
          26,   90,    9,   15,    1,  448,   60,   45,  280,    6,   63,
         324,    4,   87,    7,    7,    1,  776,  788,   19,  224,   51,
           9,  414,  514,    6,   61,   20,   15,  888,  231,   39,   35,
           1, 3537, 1670,  717,    2,  911,    6, 1075,   14,    3,   29,
         972, 1389, 1631,  135,   26,  490,  348,   35,   75,    6,  721,
          69,   85,   24, 2454,  911,  106,   12,   26,  470,   81,    5,
         121,    9,    6,   26,   34,    6, 1664,  520,   35,   10,  276,
          26,   40, 4138,  225,    7,    7,  772,    4,  643,  180,    8,
          11,   37, 1583,   80,    3,  516,    2,    3, 2353,    2,    1,
         223, 2119, 2718,  717,   79,    1,  164,  212,   25,   66,    1,
        5035,    4,    3, 5558,   51,    9,  382,    5, 1418,    1,   75,
         717,   14,  628,  904,  780,  777,   16,   28,  551,  384,  581,
           3,  223,  758,    4,   95, 3433,    3, 1311,  833,  133,    7,
           7, 1321,  344,   11,   17,    6,   15,   81,   34,   37,   20,
          28,  646,   39,  157,   60,   10,  101,    6,   88,   81,   45,
          21,   92,  785,  242,    9,  124,  350,    2,  199,  122,    3,
        7799,  746,    2, 3606, 2064,    8,   11,   17,    6,    3,  247,
         485, 1878,    6,  368,   28,    4,    1,   88, 1016,   81,  123,
           5, 1693,   11, 1220,   18,    6,   26, 2512,   70,   16,   29,
           1,  688,  204,  517,   11,  872, 7262,   70,   10,   89,  121,
          85,   81,   67,   27,  272,  493, 4558, 3584,   10,  121,   11,
          15,    3,  189,   26,    6,  342,   32,  573,  324,   18,  375,
         229,   39,   28,    4,    1,   88,   10,  437,   26,    6,   21,
           1, 1559]], dtype=int32)

Split Datasets

In [73]:
from sklearn.model_selection import train_test_split
In [74]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = 0.2)
In [75]:
X_train.shape
Out[75]:
(20000, 552)
In [76]:
X_test.shape
Out[76]:
(5000, 552)
In [78]:
input_shape = X_train.shape
In [79]:
input_shape
Out[79]:
(20000, 552)

Bag Of Words Model

Model Architecture

(Bag of words) -> Dense Layer(1024) -> Dense Layer(256) -> Dense Layer(128) -> Dense Layer(64) -> Relu -> Dense Layer(1) -> Sigmoid

In [124]:
from keras import backend as K
from keras.models import Sequential
from keras.layers import Flatten, Dense, Activation
In [125]:
K.clear_session()  # clear default graph


bow_model = Sequential()

bow_model.add(Dense(1024, input_shape=(input_shape[1],)))
bow_model.add(Activation('relu'))
# An "activation" is just a non-linear function applied to the output
# of the layer above. Here, with a "rectified linear unit",
# we clamp all values below 0 to 0.
bow_model.add(Dense(256))
bow_model.add(Activation('relu'))
bow_model.add(Dense(128))
bow_model.add(Activation('relu'))
bow_model.add(Dense(64))
bow_model.add(Activation('relu'))
bow_model.add(Dense(1))
# This special "softmax" activation among other things,
# ensures the output is a valid probaility distribution, that is
# that its values are all non-negative and sum to 1.
bow_model.add(Activation('sigmoid'))
In [126]:
bow_model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 1024)              566272    
_________________________________________________________________
activation_1 (Activation)    (None, 1024)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 256)               262400    
_________________________________________________________________
activation_2 (Activation)    (None, 256)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 128)               32896     
_________________________________________________________________
activation_3 (Activation)    (None, 128)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 64)                8256      
_________________________________________________________________
activation_4 (Activation)    (None, 64)                0         
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 65        
_________________________________________________________________
activation_5 (Activation)    (None, 1)                 0         
=================================================================
Total params: 869,889
Trainable params: 869,889
Non-trainable params: 0
_________________________________________________________________
In [127]:
bow_model.compile(loss='binary_crossentropy', 
              optimizer='rmsprop', 
              metrics=['accuracy'])
In [128]:
EPOCHS = 10
BATCH_SIZE = 256
In [129]:
%%time 
# fit model
bow_history = bow_model.fit(
    X_train, 
    y_train,  # prepared data
    batch_size = BATCH_SIZE,
    epochs = EPOCHS,
    validation_data = (X_test, y_test),
    shuffle = True,
    verbose=1,
)
Train on 20000 samples, validate on 5000 samples
Epoch 1/4
20000/20000 [==============================] - 2s 95us/step - loss: 8.0923 - acc: 0.4977 - val_loss: 7.9462 - val_acc: 0.5070
Epoch 2/4
20000/20000 [==============================] - 2s 75us/step - loss: 8.0873 - acc: 0.4982 - val_loss: 7.9462 - val_acc: 0.5070
Epoch 3/4
20000/20000 [==============================] - 2s 76us/step - loss: 8.0873 - acc: 0.4982 - val_loss: 7.9462 - val_acc: 0.5070
Epoch 4/4
20000/20000 [==============================] - 2s 78us/step - loss: 8.0873 - acc: 0.4982 - val_loss: 7.9462 - val_acc: 0.5070
CPU times: user 30.6 s, sys: 2.27 s, total: 32.8 s
Wall time: 6.88 s

Using Embeddings

In Word embeddings, words are represented by a vector i.e. series of numbers (weights). The vectors represent words in a N dimension space, in which similar meaning words are places nearer to each other while the dissimilar words are kept far. The dimensions in the space represent some latent factors, by which the words could be defined. All words are assigned some weights in each each latent factors. Words that share some common meaning have similar weights across common factors.

The word embeddings weights can be estimated during the NN model building. There are also pre-built word embeddings are available, which can be used in the model. We will discuss about the pre-built word embeddings later in the tutorial.

Word embeddings are commonly used in many Natural Language Processing (NLP) tasks because they are found to be useful representations of words and often lead to better performance in the various tasks performed. Given its widespread use, this post seeks to introduce the concept of word embeddings to the prospective NLP practitioner.

Here are couple of good references to understand embeddings

https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a

(Bag of words) -> Embeddings (8) -> Dense Layer(16) -> Relu -> Dense Layer(1) -> Sigmoid

In [338]:
from keras.layers import Embedding
from keras.optimizers import SGD
In [165]:
K.clear_session()

emb_model = Sequential()
# We specify the maximum input length to our Embedding layer
# so we can later flatten the embedded inputs
emb_model.add(Embedding(num_words, 8, input_length=max_review_length))
# After the Embedding layer, 
# our activations have shape `(samples, maxlen, 8)`.

# We flatten the 3D tensor of embeddings 
# into a 2D tensor of shape `(samples, maxlen * 8)`
emb_model.add(Flatten())

emb_model.add(Dense(16))
emb_model.add(Activation('relu'))

# We add the classifier on top
emb_model.add(Dense(1))
emb_model.add(Activation('sigmoid'))
In [166]:
emb_model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 552, 8)            80000     
_________________________________________________________________
flatten_1 (Flatten)          (None, 4416)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                70672     
_________________________________________________________________
activation_1 (Activation)    (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
_________________________________________________________________
activation_2 (Activation)    (None, 1)                 0         
=================================================================
Total params: 150,689
Trainable params: 150,689
Non-trainable params: 0
_________________________________________________________________
In [167]:
sgd = SGD(lr=0.01, momentum=0.8)

emb_model.compile(optimizer=sgd, 
              loss='binary_crossentropy', 
              metrics=['accuracy'])

emb_history = emb_model.fit(X_train, 
                    y_train,
                    epochs=20,
                    batch_size=32,
                    validation_split=0.3)
Train on 14000 samples, validate on 6000 samples
Epoch 1/20
14000/14000 [==============================] - 1s 101us/step - loss: 0.6934 - acc: 0.5014 - val_loss: 0.6929 - val_acc: 0.5117
Epoch 2/20
14000/14000 [==============================] - 1s 96us/step - loss: 0.6919 - acc: 0.5211 - val_loss: 0.6925 - val_acc: 0.5098
Epoch 3/20
14000/14000 [==============================] - 1s 86us/step - loss: 0.6886 - acc: 0.5459 - val_loss: 0.6897 - val_acc: 0.5297
Epoch 4/20
14000/14000 [==============================] - 1s 83us/step - loss: 0.6790 - acc: 0.5910 - val_loss: 0.6722 - val_acc: 0.6012
Epoch 5/20
14000/14000 [==============================] - 1s 83us/step - loss: 0.6406 - acc: 0.6576 - val_loss: 0.6224 - val_acc: 0.6585
Epoch 6/20
14000/14000 [==============================] - 1s 87us/step - loss: 0.5617 - acc: 0.7290 - val_loss: 0.5301 - val_acc: 0.7433
Epoch 7/20
14000/14000 [==============================] - 1s 84us/step - loss: 0.4692 - acc: 0.7891 - val_loss: 0.4728 - val_acc: 0.7762
Epoch 8/20
14000/14000 [==============================] - 1s 89us/step - loss: 0.3990 - acc: 0.8295 - val_loss: 0.4433 - val_acc: 0.7930
Epoch 9/20
14000/14000 [==============================] - 1s 93us/step - loss: 0.3453 - acc: 0.8540 - val_loss: 0.4361 - val_acc: 0.7950
Epoch 10/20
14000/14000 [==============================] - 1s 89us/step - loss: 0.3113 - acc: 0.8731 - val_loss: 0.4238 - val_acc: 0.8095
Epoch 11/20
14000/14000 [==============================] - 1s 85us/step - loss: 0.2705 - acc: 0.8924 - val_loss: 0.3992 - val_acc: 0.8248
Epoch 12/20
14000/14000 [==============================] - 1s 92us/step - loss: 0.2399 - acc: 0.9086 - val_loss: 0.5166 - val_acc: 0.7938
Epoch 13/20
14000/14000 [==============================] - 1s 95us/step - loss: 0.2135 - acc: 0.9216 - val_loss: 0.4264 - val_acc: 0.8192
Epoch 14/20
14000/14000 [==============================] - 1s 97us/step - loss: 0.1928 - acc: 0.9297 - val_loss: 0.4707 - val_acc: 0.8125
Epoch 15/20
14000/14000 [==============================] - 1s 96us/step - loss: 0.1623 - acc: 0.9446 - val_loss: 0.4966 - val_acc: 0.7973
Epoch 16/20
14000/14000 [==============================] - 1s 94us/step - loss: 0.1451 - acc: 0.9534 - val_loss: 0.5332 - val_acc: 0.8038
Epoch 17/20
14000/14000 [==============================] - 1s 94us/step - loss: 0.1227 - acc: 0.9613 - val_loss: 0.5137 - val_acc: 0.8102
Epoch 18/20
14000/14000 [==============================] - 1s 83us/step - loss: 0.1062 - acc: 0.9690 - val_loss: 0.5468 - val_acc: 0.8092
Epoch 19/20
14000/14000 [==============================] - 1s 90us/step - loss: 0.0881 - acc: 0.9763 - val_loss: 0.6008 - val_acc: 0.8033
Epoch 20/20
14000/14000 [==============================] - 1s 82us/step - loss: 0.0766 - acc: 0.9811 - val_loss: 0.5865 - val_acc: 0.7805
In [133]:
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
In [134]:
def plot_accuracy(hist):
    plt.plot(hist['acc'])
    plt.plot(hist['val_acc'])
    plt.title('model accuracy')
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.legend(['train', 
                'test'], 
               loc='upper left')
    plt.show()
    
def plot_loss(hist):
    plt.plot(hist['loss'])
    plt.plot(hist['val_loss'])
    plt.title('model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train', 
                'test'], 
               loc='upper left')
    plt.show()    
In [135]:
plot_accuracy( emb_history.history )
In [136]:
plot_loss( history.history )

Conclusion:

The model is overfitting. The training accuracy is about 98%, whereas the validation accuracy is 80%.

Model 3

We will try another optimizers before applying regularization (dropouts). We will also add callbacks for reducing LR and early stopping. And store tensorflow logs for monitoring.

In [140]:
from keras_tqdm import TQDMNotebookCallback
from keras.callbacks import ReduceLROnPlateau, EarlyStopping, ModelCheckpoint
from keras.callbacks import TensorBoard
In [145]:
callbacks_list = [ReduceLROnPlateau(monitor='val_loss',
                                    factor=0.1, 
                                    patience=3),
                 EarlyStopping(monitor='val_loss',
                               patience=4),
                 ModelCheckpoint(filepath='imdb_model.h5',
                                 monitor='val_loss',
                                 save_best_only=True),
                 TensorBoard("./imdb_logs"),
                 TQDMNotebookCallback(leave_inner=True,
                                      leave_outer=True)]
In [146]:
K.clear_session()

emb_model_3 = Sequential()
# We specify the maximum input length to our Embedding layer
# so we can later flatten the embedded inputs
emb_model_3.add(Embedding(10000, 8, input_length=max_review_length))
# After the Embedding layer, 
# our activations have shape `(samples, maxlen, 8)`.

# We flatten the 3D tensor of embeddings 
# into a 2D tensor of shape `(samples, maxlen * 8)`
emb_model_3.add(Flatten())

emb_model_3.add(Dense(16))
emb_model_3.add(Activation('relu'))

# We add the classifier on top
emb_model_3.add(Dense(1))
emb_model_3.add(Activation('sigmoid'))

emb_model_3.compile(optimizer="adam", 
              loss='binary_crossentropy', 
              metrics=['accuracy'])

emb_history_3 = emb_model_3.fit(X_train, 
                                y_train,
                                epochs=20,
                                batch_size=32,
                                callbacks=callbacks_list,
                                validation_split=0.3)
Train on 14000 samples, validate on 6000 samples
Epoch 1/20
14000/14000 [==============================] - 2s 124us/step - loss: 0.5531 - acc: 0.6660 - val_loss: 0.3349 - val_acc: 0.8620
Epoch 2/20
14000/14000 [==============================] - 2s 116us/step - loss: 0.2133 - acc: 0.9169 - val_loss: 0.2886 - val_acc: 0.8802
Epoch 3/20
14000/14000 [==============================] - 2s 115us/step - loss: 0.0923 - acc: 0.9728 - val_loss: 0.3306 - val_acc: 0.8738
Epoch 4/20
14000/14000 [==============================] - 2s 114us/step - loss: 0.0380 - acc: 0.9925 - val_loss: 0.3852 - val_acc: 0.8668
Epoch 5/20
14000/14000 [==============================] - 2s 114us/step - loss: 0.0149 - acc: 0.9985 - val_loss: 0.4164 - val_acc: 0.8705
Epoch 6/20
14000/14000 [==============================] - 2s 115us/step - loss: 0.0076 - acc: 0.9994 - val_loss: 0.4601 - val_acc: 0.8685

In [290]:
plot_loss( emb_history_3.history )

Model 4

Add a dropout layer as a regularization layer for dealing with overfitting.

In [212]:
from keras.layers import Dropout

K.clear_session()

emb_model_4 = Sequential()
# We specify the maximum input length to our Embedding layer
# so we can later flatten the embedded inputs
emb_model_4.add(Embedding(10000, 
                          8, 
                          input_length=max_review_length,
                          name='layer_embedding'))
# After the Embedding layer, 
# our activations have shape `(samples, maxlen, 8)`.

# We flatten the 3D tensor of embeddings 
# into a 2D tensor of shape `(samples, maxlen * 8)`
emb_model_4.add(Flatten())

emb_model_4.add(Dense(16))
emb_model_4.add(Activation('relu'))

emb_model_4.add(Dropout(0.8))

# We add the classifier on top
emb_model_4.add(Dense(1))
emb_model_4.add(Activation('sigmoid'))
emb_model_4.compile(optimizer="adam", 
              loss='binary_crossentropy', 
              metrics=['accuracy'])

emb_history_4 = emb_model_4.fit(X_train, 
                    y_train,
                    epochs=20,
                    batch_size=32,
                    callbacks = callbacks_list,
                    validation_split=0.3)
Train on 14000 samples, validate on 6000 samples
Epoch 1/20
14000/14000 [==============================] - 2s 129us/step - loss: 0.6916 - acc: 0.5183 - val_loss: 0.6617 - val_acc: 0.7092
Epoch 2/20
14000/14000 [==============================] - 2s 125us/step - loss: 0.4864 - acc: 0.7567 - val_loss: 0.3346 - val_acc: 0.8675
Epoch 3/20
14000/14000 [==============================] - 2s 125us/step - loss: 0.3221 - acc: 0.8449 - val_loss: 0.2887 - val_acc: 0.8798
Epoch 4/20
14000/14000 [==============================] - 2s 126us/step - loss: 0.2663 - acc: 0.8671 - val_loss: 0.2879 - val_acc: 0.8825
Epoch 5/20
14000/14000 [==============================] - 2s 129us/step - loss: 0.2205 - acc: 0.8844 - val_loss: 0.3160 - val_acc: 0.8782
Epoch 6/20
14000/14000 [==============================] - 2s 124us/step - loss: 0.1905 - acc: 0.8912 - val_loss: 0.3265 - val_acc: 0.8797
Epoch 7/20
14000/14000 [==============================] - 2s 126us/step - loss: 0.1692 - acc: 0.9000 - val_loss: 0.3761 - val_acc: 0.8737
Epoch 8/20
14000/14000 [==============================] - 2s 123us/step - loss: 0.1548 - acc: 0.9046 - val_loss: 0.4000 - val_acc: 0.8735

In [215]:
plot_accuracy( emb_history_4.history )
In [216]:
plot_loss( emb_history_4.history )

Model 5

Increase dropout rate from 0.8 to 0.9

In [158]:
from keras.layers import Dropout

K.clear_session()

emb_model_5 = Sequential()
# We specify the maximum input length to our Embedding layer
# so we can later flatten the embedded inputs
emb_model_5.add(Embedding(10000, 8, input_length=max_review_length))
# After the Embedding layer, 
# our activations have shape `(samples, maxlen, 8)`.

# We flatten the 3D tensor of embeddings 
# into a 2D tensor of shape `(samples, maxlen * 8)`
emb_model_5.add(Flatten())

emb_model_5.add(Dense(16))
emb_model_5.add(Activation('relu'))

emb_model_5.add(Dropout(0.9))

# We add the classifier on top
emb_model_5.add(Dense(1))
emb_model_5.add(Activation('sigmoid'))
emb_model_5.compile(optimizer="adam", 
              loss='binary_crossentropy', 
              metrics=['accuracy'])

emb_history_5 = emb_model_5.fit(X_train, 
                                y_train,
                                epochs=20,
                                batch_size=32,
                                callbacks = callbacks_list,
                                validation_split=0.3)
Train on 14000 samples, validate on 6000 samples
Epoch 1/20
14000/14000 [==============================] - 2s 130us/step - loss: 0.6940 - acc: 0.4914 - val_loss: 0.6931 - val_acc: 0.5052
Epoch 2/20
14000/14000 [==============================] - 2s 125us/step - loss: 0.6928 - acc: 0.5029 - val_loss: 0.6929 - val_acc: 0.5115
Epoch 3/20
14000/14000 [==============================] - 2s 126us/step - loss: 0.6889 - acc: 0.5183 - val_loss: 0.6548 - val_acc: 0.7153
Epoch 4/20
14000/14000 [==============================] - 2s 124us/step - loss: 0.5774 - acc: 0.6536 - val_loss: 0.4220 - val_acc: 0.8590
Epoch 5/20
14000/14000 [==============================] - 2s 124us/step - loss: 0.4768 - acc: 0.7139 - val_loss: 0.3508 - val_acc: 0.8720
Epoch 6/20
14000/14000 [==============================] - 2s 125us/step - loss: 0.4347 - acc: 0.7359 - val_loss: 0.3312 - val_acc: 0.8690
Epoch 7/20
14000/14000 [==============================] - 2s 123us/step - loss: 0.3993 - acc: 0.7454 - val_loss: 0.3070 - val_acc: 0.8793
Epoch 8/20
14000/14000 [==============================] - 2s 124us/step - loss: 0.3825 - acc: 0.7544 - val_loss: 0.2976 - val_acc: 0.8797
Epoch 9/20
14000/14000 [==============================] - 2s 123us/step - loss: 0.3710 - acc: 0.7531 - val_loss: 0.3018 - val_acc: 0.8703
Epoch 10/20
14000/14000 [==============================] - 2s 124us/step - loss: 0.3533 - acc: 0.7644 - val_loss: 0.3013 - val_acc: 0.8727
Epoch 11/20
14000/14000 [==============================] - 2s 124us/step - loss: 0.3436 - acc: 0.7688 - val_loss: 0.3090 - val_acc: 0.8740
Epoch 12/20
14000/14000 [==============================] - 2s 126us/step - loss: 0.3294 - acc: 0.7694 - val_loss: 0.3107 - val_acc: 0.8750

In [161]:
plot_accuracy( emb_history_5.history )

Checking performance on test set

We will use the model 4 for checking performance on test set and making predictions.

In [174]:
emb_model_4.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 552, 8)            80000     
_________________________________________________________________
flatten_1 (Flatten)          (None, 4416)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                70672     
_________________________________________________________________
activation_1 (Activation)    (None, 16)                0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
_________________________________________________________________
activation_2 (Activation)    (None, 1)                 0         
=================================================================
Total params: 150,689
Trainable params: 150,689
Non-trainable params: 0
_________________________________________________________________
In [180]:
X_train
Out[180]:
array([[   0,    0,    0, ...,  690,  454,  155],
       [   0,    0,    0, ...,   96,  114, 2324],
       [ 210,   77,  991, ..., 1570, 6945, 2720],
       ..., 
       [  25,   74,  446, ...,   36,    5,  741],
       [ 162,  181,   49, ..., 6855,  787,  155],
       [   0,    0,    0, ...,    5,  403, 1433]], dtype=int32)
In [182]:
result = emb_model_4.evaluate(X_test, y_test)
5000/5000 [==============================] - 0s 32us/step
In [183]:
print("Accuracy: {0:.2%}".format(result[1]))
Accuracy: 86.96%

Predicting Test Data and Confusion Matrix

We will predict the classes using model 4 and build the confusion matrix to understand precision and recall.

In [217]:
y_pred = emb_model_4.predict_classes(X_test[0:1000])
In [218]:
y_pred[0:10]
Out[218]:
array([[1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [1],
       [1],
       [1]], dtype=int32)
In [219]:
from sklearn import metrics

cm = metrics.confusion_matrix( y_test[0:1000],
                            y_pred, [1,0] )
In [220]:
sn.heatmap(cm, annot=True,  
           fmt='.2f', 
           xticklabels = ["Positive", "Negative"] , 
           yticklabels = ["Positive", "Negative"] )

plt.ylabel('True label')
plt.xlabel('Predicted label');
plt.title( 'Confusion Matrix for Sentiment Classification');
In [221]:
from sklearn.metrics import classification_report
In [222]:
print( classification_report(y_test[0:1000], 
                             y_pred))
             precision    recall  f1-score   support

          0       0.88      0.84      0.86       485
          1       0.86      0.89      0.87       515

avg / total       0.87      0.87      0.87      1000

ROC Curve and AUC Score

In [223]:
y_pred_probs = emb_model_4.predict(X_test[0:1000])
In [224]:
auc_score = metrics.roc_auc_score( y_test[0:1000], 
                                  y_pred_probs  )

fpr, tpr, thresholds = metrics.roc_curve( y_test[0:1000],
                                         y_pred_probs,
                                         drop_intermediate = False )

plt.figure(figsize=(8, 6))
plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

AUC score is 94%, which is considered a very good model. We will not discuss the optimal cutoff probability here.

Peeping into Embeddings

We will look at the embeddings estimated for different words and if they are placed neared or far as per their meaning.

In [225]:
layer_embedding = emb_model_4.get_layer('layer_embedding')
In [226]:
weights_embedding = layer_embedding.get_weights()[0]
In [227]:
weights_embedding.shape
Out[227]:
(10000, 8)
In [241]:
def get_embeddings( word ):
    token = tokenizer.word_index[word]
    return weights_embedding[token]
In [262]:
good = get_embeddings('good')
good
Out[262]:
array([ 0.01776998,  0.0682123 , -0.01675226, -0.04782688,  0.03007047,
       -0.01011649, -0.01771963,  0.01861562], dtype=float32)
In [248]:
great = get_embeddings('great')
great
Out[248]:
array([-0.13709876, -0.20187837, -0.19008374,  0.07072515, -0.01821891,
       -0.02970931,  0.1139439 ,  0.32559609], dtype=float32)
In [249]:
bad = get_embeddings('bad')
bad
Out[249]:
array([ 0.18650791,  0.20808348,  0.13235158, -0.14478587, -0.01979927,
       -0.06167486, -0.26055247, -0.3227689 ], dtype=float32)
In [250]:
terrible = get_embeddings('terrible')
terrible
Out[250]:
array([ 0.21999243,  0.36748612,  0.3505215 , -0.1196411 , -0.13720652,
        0.0443155 , -0.36997846, -0.27958751], dtype=float32)

We will calculate the euclidean distance between the word embeddings.

In [246]:
from scipy.spatial.distance import cdist
In [339]:
def get_distance( word1, word2 ):
    
    word1_token = tokenizer.word_index[word1]
    word2_token = tokenizer.word_index[word2]    
    
    return cdist([weights_embedding[word1_token]], 
                 [weights_embedding[word2_token]], 
                 metric = 'euclidean')
In [271]:
get_distance( 'good', 
             'great' )
Out[271]:
array([[ 0.37356725]])
In [272]:
get_distance( 'good', 'bad' )
Out[272]:
array([[ 0.66995235]])
In [273]:
get_distance( 'terrible', 'bad' )
Out[273]:
array([[ 0.33707253]])
In [274]:
get_distance( 'good', 'terrible' )
Out[274]:
array([[ 0.89196814]])
In [275]:
get_distance( 'great', 'terrible' )
Out[275]:
array([[ 1.18332124]])

It can be observed that the words good and great are places together, while bad and terrible are place together. And the words good and terrible are place far. This indicates the embeddings have incorporated the meaning of the words as per how they are used in the sentences expressing positive and negative sentiments.

Some more examples expressing sentiments.

In [285]:
get_distance('wonderful','awesome')
Out[285]:
array([[ 0.37351485]])
In [286]:
get_distance('wonderful','pathetic')
Out[286]:
array([[ 0.9278662]])
In [287]:
get_distance('awesome','pathetic')
Out[287]:
array([[ 0.64035885]])

Applying Pre trained embeddings

Word embeddings are generally computed using word-occurrence statistics (observations about what words co-occur in sentences or documents), using a variety of techniques, some involving neural networks, others not. The idea of a dense, lowdimensional embedding space for words, computed in an unsupervised way, was initially explored by Bengio et al. in the early 2000s,1 but it only started to take off in research and industry applications after the release of one of the most famous and successful word-embedding schemes: the Word2vec algorithm (https://code.google.com/ archive/p/word2vec), developed by Tomas Mikolov at Google in 2013. Word2vec dimensions capture specific semantic properties, such as gender.

There are various precomputed databases of word embeddings that you can download and use in a Keras Embedding layer. Word2vec is one of them. Another popular one is called Global Vectors for Word Representation (GloVe, https://nlp.stanford.edu/projects/glove), which was developed by Stanford researchers in 2014. This embedding technique is based on factorizing a matrix of word co-occurrence statistics. Its developers have made available precomputed embeddings for millions of English tokens, obtained from Wikipedia data and Common Crawl data.

One of the most widely used pretrained word embeddings is Glove and can be downloaded from https://nlp.stanford.edu/projects/glove/

GloVe is pre-computed embeddings from 2014 English Wikipedia. It's a 822MB zip file named glove.6B.zip, containing 100-dimensional embedding vectors for 400,000 words (or non-word tokens).

In [288]:
import os

glove_dir = '/Users/manaranjan/Documents/Work/MyLearnings/DeepLearning/DL_DraftCourse/DLP Book/data'

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
line_num = 0

for line in f:
    ## The following code is done for printing the first line 
    if( line_num == 0):
        print( line )
        line_num += 1
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))
the -0.038194 -0.24487 0.72812 -0.39961 0.083172 0.043953 -0.39141 0.3344 -0.57545 0.087459 0.28787 -0.06731 0.30906 -0.26384 -0.13231 -0.20757 0.33395 -0.33848 -0.31743 -0.48336 0.1464 -0.37304 0.34577 0.052041 0.44946 -0.46971 0.02628 -0.54155 -0.15518 -0.14107 -0.039722 0.28277 0.14393 0.23464 -0.31021 0.086173 0.20397 0.52624 0.17164 -0.082378 -0.71787 -0.41531 0.20335 -0.12763 0.41367 0.55187 0.57908 -0.33477 -0.36559 -0.54857 -0.062892 0.26584 0.30205 0.99775 -0.80481 -3.0243 0.01254 -0.36942 2.2167 0.72201 -0.24978 0.92136 0.034514 0.46745 1.1079 -0.19358 -0.074575 0.23353 -0.052062 -0.22044 0.057162 -0.15806 -0.30798 -0.41625 0.37972 0.15006 -0.53212 -0.2055 -1.2526 0.071624 0.70565 0.49744 -0.42063 0.26148 -1.538 -0.30223 -0.073438 -0.28312 0.37104 -0.25217 0.016215 -0.017099 -0.38984 0.87424 -0.72569 -0.51058 -0.52028 -0.1459 0.8278 0.27062

Found 400000 word vectors.

Get the word indexes from the our tokenizer, which contains the indexes of the words in our corpus.

In [291]:
word_index = tokenizer.word_index
In [292]:
embedding_dim = 100 #This is because we have downloaded GloVec for 100d embeddings
max_words = 10000

### The embedding matrix will have 
embedding_matrix = np.zeros((max_words, 
                             embedding_dim))

for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < max_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

Embedding Model

In [303]:
K.clear_session()

pre_trained_emb_model = Sequential()
pre_trained_emb_model.add(Embedding(max_words, 
                                    embedding_dim, 
                                    input_length=max_review_length))
pre_trained_emb_model.add(Flatten())
pre_trained_emb_model.add(Dense(32, activation='relu'))
pre_trained_emb_model.add(Dense(1, activation='sigmoid'))
pre_trained_emb_model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 552, 100)          1000000   
_________________________________________________________________
flatten_1 (Flatten)          (None, 55200)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 32)                1766432   
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 33        
=================================================================
Total params: 2,766,465
Trainable params: 2,766,465
Non-trainable params: 0
_________________________________________________________________

The Embedding layer has a single weight matrix: a 2D float matrix where each entry i is the word vector meant to be associated with index i. Simple enough. Let's just load the GloVe matrix we prepared into our Embedding layer, the first layer in our model:

Additionally, we freeze the embedding layer (we set its trainable attribute to False), following the same rationale as what you are already familiar with in the context of pre-trained convnet features: when parts of a model are pre-trained (like our Embedding layer), and parts are randomly initialized (like our classifier), the pre-trained parts should not be updated during training to avoid forgetting what they already know. The large gradient update triggered by the randomly initialized layers would be very disruptive to the already learned features.

In [304]:
pre_trained_emb_model.layers[0].set_weights([embedding_matrix])
pre_trained_emb_model.layers[0].trainable = True
In [332]:
callbacks_list = [ReduceLROnPlateau(monitor='val_loss',
                                    factor=0.1, 
                                    patience=3),
                 EarlyStopping(monitor='val_loss',
                               patience=4),
                 ModelCheckpoint(filepath='imdb_pretrained_model.h5',
                                 monitor='val_loss',
                                 save_best_only=True),
                 TensorBoard("./imdb_pretrained_logs"),
                 TQDMNotebookCallback(leave_inner=True,
                                      leave_outer=True)]
In [333]:
pre_trained_emb_model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

pre_trained_emb_history = pre_trained_emb_model.fit(X_train, 
                                                    y_train,
                                                    epochs=10,
                                                    batch_size=128,
                                                    callbacks = callbacks_list,
                                                    validation_split=0.1)
Train on 18000 samples, validate on 2000 samples
Epoch 1/10
18000/18000 [==============================] - 14s 758us/step - loss: 0.3039 - acc: 0.8707 - val_loss: 0.4080 - val_acc: 0.8310
Epoch 2/10
18000/18000 [==============================] - 12s 674us/step - loss: 0.2425 - acc: 0.8995 - val_loss: 0.5436 - val_acc: 0.8065
Epoch 3/10
18000/18000 [==============================] - 13s 736us/step - loss: 0.2050 - acc: 0.9137 - val_loss: 0.5876 - val_acc: 0.8415
Epoch 4/10
18000/18000 [==============================] - 14s 796us/step - loss: 0.1789 - acc: 0.9231 - val_loss: 0.5079 - val_acc: 0.8375
Epoch 5/10
18000/18000 [==============================] - 13s 699us/step - loss: 0.1501 - acc: 0.9359 - val_loss: 0.6891 - val_acc: 0.8215

Embedding Layer with Dropouts

In [334]:
K.clear_session()

pre_trained_emb_model = Sequential()
pre_trained_emb_model.add(Embedding(max_words, 
                                    embedding_dim, 
                                    input_length=max_review_length))
pre_trained_emb_model.add(Flatten())
pre_trained_emb_model.add(Dense(64))
pre_trained_emb_model.add(Activation('relu'))
pre_trained_emb_model.add(Dropout(0.4))

pre_trained_emb_model.add(Dense(1))
pre_trained_emb_model.add(Activation('sigmoid'))
pre_trained_emb_model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 552, 100)          1000000   
_________________________________________________________________
flatten_1 (Flatten)          (None, 55200)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                3532864   
_________________________________________________________________
activation_1 (Activation)    (None, 64)                0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 65        
_________________________________________________________________
activation_2 (Activation)    (None, 1)                 0         
=================================================================
Total params: 4,532,929
Trainable params: 4,532,929
Non-trainable params: 0
_________________________________________________________________
In [335]:
pre_trained_emb_model.layers[0].set_weights([embedding_matrix])
pre_trained_emb_model.layers[0].trainable = True
In [336]:
pre_trained_emb_model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

pre_trained_emb_history = pre_trained_emb_model.fit(X_train, 
                                                    y_train,
                                                    epochs=30,
                                                    batch_size=256,
                                                    callbacks=callbacks_list,
                                                    validation_split=0.2)
Train on 16000 samples, validate on 4000 samples
Epoch 1/30
16000/16000 [==============================] - 9s 570us/step - loss: 1.1220 - acc: 0.5217 - val_loss: 0.6799 - val_acc: 0.6235
Epoch 2/30
16000/16000 [==============================] - 9s 572us/step - loss: 0.6608 - acc: 0.5940 - val_loss: 0.6325 - val_acc: 0.6767
Epoch 3/30
16000/16000 [==============================] - 10s 613us/step - loss: 0.6135 - acc: 0.6401 - val_loss: 0.5601 - val_acc: 0.7710
Epoch 4/30
16000/16000 [==============================] - 10s 647us/step - loss: 0.5521 - acc: 0.6788 - val_loss: 0.5099 - val_acc: 0.7907
Epoch 5/30
16000/16000 [==============================] - 10s 641us/step - loss: 0.5029 - acc: 0.7136 - val_loss: 0.4985 - val_acc: 0.7673
Epoch 6/30
16000/16000 [==============================] - 11s 658us/step - loss: 0.4508 - acc: 0.7638 - val_loss: 0.4241 - val_acc: 0.8255
Epoch 7/30
16000/16000 [==============================] - 9s 573us/step - loss: 0.3899 - acc: 0.8123 - val_loss: 0.3860 - val_acc: 0.8372
Epoch 8/30
16000/16000 [==============================] - 9s 574us/step - loss: 0.3500 - acc: 0.8449 - val_loss: 0.3771 - val_acc: 0.8417
Epoch 9/30
16000/16000 [==============================] - 9s 567us/step - loss: 0.2962 - acc: 0.8683 - val_loss: 0.3944 - val_acc: 0.8322
Epoch 10/30
16000/16000 [==============================] - 9s 576us/step - loss: 0.2705 - acc: 0.8867 - val_loss: 0.3698 - val_acc: 0.8535
Epoch 11/30
16000/16000 [==============================] - 9s 572us/step - loss: 0.2176 - acc: 0.9112 - val_loss: 0.4714 - val_acc: 0.8392
Epoch 12/30
16000/16000 [==============================] - 9s 574us/step - loss: 0.1964 - acc: 0.9218 - val_loss: 0.4646 - val_acc: 0.8260
Epoch 13/30
16000/16000 [==============================] - 9s 579us/step - loss: 0.1671 - acc: 0.9346 - val_loss: 0.4578 - val_acc: 0.8532
Epoch 14/30
16000/16000 [==============================] - 9s 571us/step - loss: 0.1485 - acc: 0.9426 - val_loss: 0.4541 - val_acc: 0.8523

In [337]:
plot_accuracy(pre_trained_emb_history.history)

Excellent References

For further exploration and better understanding, you can use the following references.