# Deep Learning 5: IMDB Classification - RNN and LSTM¶

We have explored the techniques of encoding the text data into Bag of Words and Embeddimgs model. And then used the encoding vectors to build simple NN models to predict if the sentiment of the sentence is positive or negative. In this tutorial we will go further and explore advanced models for sequence modelling using Recurrent Neural Networks and LSTM. RNN and LSTM models are widely used in natural language processing and times series predictions as these models have the ability to incorprate the temporal or sequential dependency of the features (words) i.e. the meaning for the sentence depends on where the word appears in the sentence.

But first we will start loading and encoding the sentences.

### IMDB Movie Reviews¶

The dataset is available at https://www.kaggle.com/c/word2vec-nlp-tutorial/data

The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews.

Data Fields

• id - Unique ID of each review
• sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews
• review - Text of the review

In [35]:
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Activation, Dense, Embedding, SimpleRNN
from keras import backend as K
from keras_tqdm import TQDMNotebookCallback
from keras.callbacks import ReduceLROnPlateau, EarlyStopping, ModelCheckpoint
from keras.callbacks import TensorBoard
In [3]:
imdb_df = pd.read_csv('./labeledTrainData.tsv', sep = '\t')
In [4]:
pd.set_option('display.max_colwidth', 500)
Out[4]:
id sentiment review
0 5814_8 1 With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle m...
1 2381_9 1 \The Classic War of the Worlds\" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic book. Mr. Hines succeeds in doing so. I, and those who watched his film with me, appreciated the fact that it was not the standard, predictable Hollywood fare that comes out every year, e.g. the Spielberg version with Tom Cruise that had only the slightest resemblance to the book. Obviously, everyone looks for different thin...
2 7759_3 0 The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like ¨Jurassik Park¨, and some scientists resurrect one of nature's most fearsome predators, the Sabretooth tiger or Smilodon . Scientific ambition turns deadly, however, and when the high voltage fence is opened the creature escape and begins savagely stalking its prey - the human visitors , tourists and scientific.Meanwh...
3 3630_4 0 It must be assumed that those who praised this film (\the greatest filmed opera ever,\" didn't I read somewhere?) either don't care for opera, don't care for Wagner, or don't care about anything except their desire to appear Cultured. Either as a representation of Wagner's swan-song, or as a movie, this strikes me as an unmitigated disaster, with a leaden reading of the score matched to a tricksy, lugubrious realisation of the text.<br /><br />It's questionable that people with ideas as to w...
4 9495_8 1 Superbly trashy and wondrously unpretentious 80's exploitation, hooray! The pre-credits opening sequences somewhat give the false impression that we're dealing with a serious and harrowing drama, but you need not fear because barely ten minutes later we're up until our necks in nonsensical chainsaw battles, rough fist-fights, lurid dialogs and gratuitous nudity! Bo and Ingrid are two orphaned siblings with an unusually close and even slightly perverted relationship. Can you imagine playfully...

### Data Tokenization¶

In [19]:
from keras.preprocessing.text import Tokenizer
In [20]:
num_words = 10000
In [21]:
tokenizer = Tokenizer(num_words = num_words)
In [22]:
tokenizer.fit_on_texts( imdb_df.review )

Tokenizer provides 4 attributes that you can use to query what has been learned about your documents:

• word_counts: A dictionary of words and their counts.
• word_docs: A dictionary of words and how many documents each appeared in.
• word_index: A dictionary of words and their uniquely assigned integers.
• document_count:An integer count of the total number of documents that were used to fit the Tokenizer.

### Encoding the documents¶

#### Count vectors¶

In [23]:
sequences = tokenizer.texts_to_sequences(imdb_df.review)

### Encode Y Variable¶

In [24]:
y = np.array(imdb_df.sentiment)
In [25]:
y[0:5]
Out[25]:
array([1, 1, 0, 0, 1])

### Trim X¶

In [26]:
from keras.preprocessing.sequence import pad_sequences

max_review_length = 552

max_review_length,
In [27]:
X[0:1]
Out[27]:
array([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
0,   16,   29,   11,  535,  167,  177,   30,    1,  558,   16,
204,  642, 2615,    5,   24,  225,  146,    1, 1028,  659,  130,
2,   47,  293,    1,    2,  293,  171,  276,   10,   40,  178,
5,   76,    3,  810, 2616,   80,   11,  229,   34,   10,  194,
13,   63,  643,    8,    1, 4252,   40,    5,  276,   94,   53,
58,  327,  723,   26,    6, 2512,   39, 1351,    6,  170, 5034,
170,  788,   19,   60,   10,  374,  167,    5,   64,   30,    1,
434,   51,    9,   13, 1816,  622,   46,    4,    9,   44, 1299,
3432,   41,  544,  946,    1, 3513,    2,   79,    1,  574,  746,
4, 1664,   23,   75,    7,    7, 2006, 1156,   18,    4,  261,
11,    6,   29,   41,  485, 1878,   35,  891,   22, 2588,   37,
8,  550,   92,   22,   23,  167,    5,  780,   11,    2,  166,
9,  354,   46,  200,  680,   32,   15,    5,    1,  228,    4,
11,   17,   18,    2,   88,    4,   24,  448,   59,  132,   12,
26,   90,    9,   15,    1,  448,   60,   45,  280,    6,   63,
324,    4,   87,    7,    7,    1,  776,  788,   19,  224,   51,
9,  414,  514,    6,   61,   20,   15,  888,  231,   39,   35,
1, 3537, 1670,  717,    2,  911,    6, 1075,   14,    3,   29,
972, 1389, 1631,  135,   26,  490,  348,   35,   75,    6,  721,
69,   85,   24, 2454,  911,  106,   12,   26,  470,   81,    5,
121,    9,    6,   26,   34,    6, 1664,  520,   35,   10,  276,
26,   40, 4138,  225,    7,    7,  772,    4,  643,  180,    8,
11,   37, 1583,   80,    3,  516,    2,    3, 2353,    2,    1,
223, 2119, 2718,  717,   79,    1,  164,  212,   25,   66,    1,
5035,    4,    3, 5558,   51,    9,  382,    5, 1418,    1,   75,
717,   14,  628,  904,  780,  777,   16,   28,  551,  384,  581,
3,  223,  758,    4,   95, 3433,    3, 1311,  833,  133,    7,
7, 1321,  344,   11,   17,    6,   15,   81,   34,   37,   20,
28,  646,   39,  157,   60,   10,  101,    6,   88,   81,   45,
21,   92,  785,  242,    9,  124,  350,    2,  199,  122,    3,
7799,  746,    2, 3606, 2064,    8,   11,   17,    6,    3,  247,
485, 1878,    6,  368,   28,    4,    1,   88, 1016,   81,  123,
5, 1693,   11, 1220,   18,    6,   26, 2512,   70,   16,   29,
1,  688,  204,  517,   11,  872, 7262,   70,   10,   89,  121,
85,   81,   67,   27,  272,  493, 4558, 3584,   10,  121,   11,
15,    3,  189,   26,    6,  342,   32,  573,  324,   18,  375,
229,   39,   28,    4,    1,   88,   10,  437,   26,    6,   21,
1, 1559]], dtype=int32)

### Split Datasets¶

In [28]:
from sklearn.model_selection import train_test_split
In [29]:
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size = 0.2)
In [30]:
X_train.shape
Out[30]:
(20000, 552)
In [31]:
X_test.shape
Out[31]:
(5000, 552)
In [32]:
input_shape = X_train.shape
In [33]:
input_shape
Out[33]:
(20000, 552)

### Model 1 - RNN and Embeddings¶

In a simple NN model, the model learns from if the features(here it is words) exist in the sentence or not. And presence or absence of the word, along with the weights associated with it, decides the sentiment of the sentence. The position or the sequence of the word in the sentence doe not matter. This is actually not right, because the presence of the word in differnt position can actually alter the meaning of the sentence.

For example,

1. The movie is good, not bad at all (postive sentiment)
2. The movie is bad, not good at all (negative sentiment)

The data which is temporal (series or time sequence) in nature should be treated differently and the time sequence associated with it should be incoprated into the model.

RNN (Recurrent Neural Network) model takes care of this dependency.

In a RNN the neurons takes into consideration the current input and also what it has learned from the inputs it received previously.

The sentences are fed to the RNN model token by token in the same sequence at they appear in th sentence. When a new token is fed to the RNN unit, it associates a weight with the token and also takes the output of neuron when the previous token was fed to the neuron. The following diagrams depict this model better.

And the weights are calculated as below.

In the above diagram $x_{i}$ is the token (word) and $y$ is the sentiment of the sentence.

Excellent articles on understanding a simple RNN model are:

In [39]:
K.clear_session()

rnn_model = Sequential()
# We specify the maximum input length to our Embedding layer
# so we can later flatten the embedded inputs
8,
input_length=max_review_length))

rnn_model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_1 (Embedding)      (None, 552, 8)            80000
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 32)                1312
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33
_________________________________________________________________
activation_1 (Activation)    (None, 1)                 0
=================================================================
Total params: 81,345
Trainable params: 81,345
Non-trainable params: 0
_________________________________________________________________
In [40]:
callbacks_list = [ReduceLROnPlateau(monitor='val_loss',
factor=0.1,
patience=3),
EarlyStopping(monitor='val_loss',
patience=4),
ModelCheckpoint(filepath='imdb_rnn_model.h5',
monitor='val_loss',
save_best_only=True),
TensorBoard("./imdb_rnn_logs"),
TQDMNotebookCallback(leave_inner=True,
leave_outer=True)]
In [41]:
loss='binary_crossentropy',
metrics=['accuracy'])

rnn_history = rnn_model.fit(X_train,
y_train,
epochs=10,
batch_size=32,
callbacks=callbacks_list,
validation_split=0.3)
Train on 14000 samples, validate on 6000 samples
Epoch 1/10
14000/14000 [==============================] - 117s 8ms/step - loss: 0.6872 - acc: 0.5449 - val_loss: 0.6598 - val_acc: 0.6238
Epoch 2/10
14000/14000 [==============================] - 117s 8ms/step - loss: 0.4744 - acc: 0.7906 - val_loss: 0.4278 - val_acc: 0.8115
Epoch 3/10
14000/14000 [==============================] - 118s 8ms/step - loss: 0.3086 - acc: 0.8744 - val_loss: 0.3968 - val_acc: 0.8338
Epoch 4/10
14000/14000 [==============================] - 118s 8ms/step - loss: 0.1951 - acc: 0.9298 - val_loss: 0.4053 - val_acc: 0.8378
Epoch 5/10
14000/14000 [==============================] - 118s 8ms/step - loss: 0.1276 - acc: 0.9583 - val_loss: 0.4917 - val_acc: 0.8208
Epoch 6/10
14000/14000 [==============================] - 125s 9ms/step - loss: 0.0856 - acc: 0.9729 - val_loss: 0.5592 - val_acc: 0.8030
Epoch 7/10
14000/14000 [==============================] - 120s 9ms/step - loss: 0.0606 - acc: 0.9809 - val_loss: 0.5937 - val_acc: 0.8170
In [42]:
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
In [43]:
def plot_accuracy(hist):
plt.plot(hist['acc'])
plt.plot(hist['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train',
'test'],
loc='upper left')
plt.show()

def plot_loss(hist):
plt.plot(hist['loss'])
plt.plot(hist['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train',
'test'],
loc='upper left')
plt.show()
In [44]:
plot_accuracy( rnn_history.history )

### LSTM¶

RNNs are typically known to learn the word dependencies if they are next to each other or nearer to each other. If the words are far from each other in a sentence are dependent, then RNNs are not known to learn better. For example, the sentence:

I grew up in france, hence beside several other languages I also speak .....

So, if the model need to predict the word french, then it needs to remember the word france, which appears at very beginning of the sentence.

RNNs suffer from something called exploding and vanishing gradients.

These problems are addressed in LSTM (Long Short Term Memory) models, which is capable of learning long-term dependencies.

LSTM’s enable RNN’s to remember their inputs over a long period of time. This is because LSTM’s contain their information in a memory, that is much like the memory of a computer because the LSTM can read, write and delete information from its memory.

An excellent artile on Understanding LSTM Networks available here

In [57]:
from keras.layers import LSTM

K.clear_session()

lstm_model = Sequential()
# We specify the maximum input length to our Embedding layer
# so we can later flatten the embedded inputs
32,
input_length=max_review_length))

lstm_model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_1 (Embedding)      (None, 552, 32)           320000
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                8320
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33
_________________________________________________________________
activation_1 (Activation)    (None, 1)                 0
=================================================================
Total params: 328,353
Trainable params: 328,353
Non-trainable params: 0
_________________________________________________________________
In [58]:
loss='binary_crossentropy',
metrics=['accuracy'])

lstm_history = lstm_model.fit(X_train,
y_train,
epochs=10,
batch_size=128,
callbacks=callbacks_list,
validation_split=0.3)
Train on 14000 samples, validate on 6000 samples
Epoch 1/10
14000/14000 [==============================] - 124s 9ms/step - loss: 0.5557 - acc: 0.7106 - val_loss: 0.3586 - val_acc: 0.8483
Epoch 2/10
14000/14000 [==============================] - 124s 9ms/step - loss: 0.4026 - acc: 0.8371 - val_loss: 0.3457 - val_acc: 0.8473
Epoch 3/10
14000/14000 [==============================] - 123s 9ms/step - loss: 0.2316 - acc: 0.9129 - val_loss: 0.3057 - val_acc: 0.8755
Epoch 4/10
14000/14000 [==============================] - 123s 9ms/step - loss: 0.1618 - acc: 0.9449 - val_loss: 0.3270 - val_acc: 0.8725
Epoch 5/10
14000/14000 [==============================] - 123s 9ms/step - loss: 0.1083 - acc: 0.9672 - val_loss: 0.3389 - val_acc: 0.8687
Epoch 6/10
14000/14000 [==============================] - 123s 9ms/step - loss: 0.0791 - acc: 0.9786 - val_loss: 0.3635 - val_acc: 0.8677
Epoch 7/10
14000/14000 [==============================] - 124s 9ms/step - loss: 0.0651 - acc: 0.9821 - val_loss: 0.4351 - val_acc: 0.8618
In [59]:
plot_accuracy( lstm_history.history )

### LSTM with Dropouts¶

In [67]:
from keras.layers import LSTM

K.clear_session()

lstm_model = Sequential()
# We specify the maximum input length to our Embedding layer
# so we can later flatten the embedded inputs
8,
input_length=max_review_length))

lstm_model.summary()

loss='binary_crossentropy',
metrics=['accuracy'])

lstm_history = lstm_model.fit(X_train,
y_train,
epochs=10,
batch_size=32,
callbacks=callbacks_list,
validation_split=0.3)
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_1 (Embedding)      (None, None, 32)          320000
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                8320
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33
_________________________________________________________________
activation_1 (Activation)    (None, 1)                 0
=================================================================
Total params: 328,353
Trainable params: 328,353
Non-trainable params: 0
_________________________________________________________________
Train on 14000 samples, validate on 6000 samples
Epoch 1/10
14000/14000 [==============================] - 100s 7ms/step - loss: 0.5324 - acc: 0.7248 - val_loss: 0.4401 - val_acc: 0.7997
Epoch 2/10
14000/14000 [==============================] - 100s 7ms/step - loss: 0.3426 - acc: 0.8612 - val_loss: 0.4263 - val_acc: 0.8035
Epoch 3/10
14000/14000 [==============================] - 100s 7ms/step - loss: 0.2810 - acc: 0.8924 - val_loss: 0.4474 - val_acc: 0.8005
Epoch 4/10
14000/14000 [==============================] - 98s 7ms/step - loss: 0.2279 - acc: 0.9152 - val_loss: 0.4634 - val_acc: 0.8127
Epoch 5/10
14000/14000 [==============================] - 96s 7ms/step - loss: 0.1883 - acc: 0.9291 - val_loss: 0.4887 - val_acc: 0.8178
Epoch 6/10
14000/14000 [==============================] - 99s 7ms/step - loss: 0.1549 - acc: 0.9436 - val_loss: 0.5405 - val_acc: 0.8147
Epoch 7/10
14000/14000 [==============================] - 99s 7ms/step - loss: 0.1376 - acc: 0.9502 - val_loss: 0.5890 - val_acc: 0.8172
Epoch 8/10
14000/14000 [==============================] - 102s 7ms/step - loss: 0.1172 - acc: 0.9594 - val_loss: 0.6202 - val_acc: 0.7953
Epoch 9/10
14000/14000 [==============================] - 103s 7ms/step - loss: 0.1010 - acc: 0.9652 - val_loss: 0.6968 - val_acc: 0.8092
Epoch 10/10
14000/14000 [==============================] - 95s 7ms/step - loss: 0.1258 - acc: 0.9551 - val_loss: 0.6491 - val_acc: 0.8035
In [53]:
plot_accuracy( lstm_history.history )

### Sequence Processing with ConvNets¶

In [49]:
from keras.layers import Conv1D, MaxPooling1D, GlobalMaxPooling1D, Flatten
In [50]:
K.clear_session()

model = Sequential()
8,
input_length=max_review_length))
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_1 (Embedding)      (None, 552, 8)            80000
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 546, 32)           1824
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 109, 32)           0
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 103, 32)           7200
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 32)                0
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33
=================================================================
Total params: 89,057
Trainable params: 89,057
Non-trainable params: 0
_________________________________________________________________
In [51]:
from keras.optimizers import RMSprop
In [52]:
rmsprop = RMSprop(lr=0.0001)
model.compile(optimizer=rmsprop,
loss='binary_crossentropy',
metrics=['accuracy'])

history = model.fit(X_train,
y_train,
epochs=10,
batch_size=128,
validation_split=0.2)
Train on 16000 samples, validate on 4000 samples
Epoch 1/10
16000/16000 [==============================] - 9s 564us/step - loss: 1.3781 - acc: 0.4978 - val_loss: 0.8659 - val_acc: 0.4943
Epoch 2/10
16000/16000 [==============================] - 9s 550us/step - loss: 0.7350 - acc: 0.5091 - val_loss: 0.6901 - val_acc: 0.5605
Epoch 3/10
16000/16000 [==============================] - 9s 554us/step - loss: 0.6844 - acc: 0.5881 - val_loss: 0.6845 - val_acc: 0.5740
Epoch 4/10
16000/16000 [==============================] - 9s 557us/step - loss: 0.6770 - acc: 0.6389 - val_loss: 0.6786 - val_acc: 0.6498
Epoch 5/10
16000/16000 [==============================] - 9s 561us/step - loss: 0.6686 - acc: 0.6778 - val_loss: 0.6708 - val_acc: 0.6388
Epoch 6/10
16000/16000 [==============================] - 9s 570us/step - loss: 0.6568 - acc: 0.7074 - val_loss: 0.6591 - val_acc: 0.6887
Epoch 7/10
16000/16000 [==============================] - 9s 580us/step - loss: 0.6396 - acc: 0.7349 - val_loss: 0.6410 - val_acc: 0.6900
Epoch 8/10
16000/16000 [==============================] - 9s 583us/step - loss: 0.6122 - acc: 0.7558 - val_loss: 0.6120 - val_acc: 0.7160
Epoch 9/10
16000/16000 [==============================] - 9s 588us/step - loss: 0.5736 - acc: 0.7726 - val_loss: 0.5732 - val_acc: 0.7482
Epoch 10/10
16000/16000 [==============================] - 9s 586us/step - loss: 0.5246 - acc: 0.7896 - val_loss: 0.5320 - val_acc: 0.7655

### Saving the Model¶

For applying the model for making predictions in future, we need to save the following parameters. For example, we are going to use the model lstm_history

1. The model object (it includes the model parameters)
2. The tokenizer (this contains the words and their indexes we used to build the mode.) The new sentence should have the same indexes as the dictionary we used.
3. max_review_length
4. The padding and truncating strategy we used.
In [60]:
lstm_model.save_weights( 'lstm_model_file.h5' )
In [61]:
word_preprocess_details = { max_review_length: 'max_review_length',
'tokenizer': tokenizer }
In [62]:
import pickle
In [66]:
with open( 'word_preprocess.pkl', 'wb' ) as f:
pickle.dump( word_preprocess_details, f )
In [68]:
!ls -al *.h5
-rw-r--r--  1 manaranjan  staff   1836984 Jun  6 14:10 imdb_model.h5
-rw-r--r--  1 manaranjan  staff  36288576 Jun  7 10:41 imdb_pretrained_model.h5
-rw-r--r--  1 manaranjan  staff   3965720 Jun  7 13:52 imdb_rnn_model.h5
-rw-r--r--  1 manaranjan  staff   1329512 Jun  7 14:11 lstm_model_file.h5
In [69]:
!ls -al word_preprocess.pkl
-rw-r--r--  1 manaranjan  staff  3601056 Jun  7 14:15 word_preprocess.pkl

### Conclusion:¶

Through tutorials Deep Learning 4 and 5, we explored the steps of

• Preprocessing text documents using Bag of Words and Embedding model
• Using RNN model to predict
• Using LSTM Model to incorporate longer dependencies.
• Using Conv1D model for building higher level features.