Monday, 11 July 2016

Recurrent Neural Networks(RNN) and Long- Short Term Memory (LSTM)

Recurrent Neural Network (RNN)
In CNN, we applied the same filter in space. That is we applied the same filter over different areas of the image. We do something similar in RNN. If we know that the sequence of events are stationary over time, we can apply the same W, at every time stamp.

If we are training a network from a sequence of data, it is a good idea to have a summary of the past as an input to the classifier. To have full summary at every step will cause a very complicated model especially for very deep networks. Therefore, we have a single node per step connecting to the previous step and providing information from immediate past.

Theoretically, this should give us a good, stable model but this is not good for SGD. This is because, when we back propagate in time, all the gradients are looking to modify the same W. This causes a lot of correlation problems. This leads to either exploding or vanishing gradient.



Exploding gradient can be handled by normalizing (penalizing) the W when the change in W becomes almost 0. But for vanishing gradient, the classifier starts forgetting the older stuff and only remembers the new one.




Long Short Term Memory
Here is where LSTM comes into action. We replace the W in the vanilla network with a memory cell. We are trying to help RNNs remember better. A memory cell must do 3 things:

So, we replace W with the following memory cell. At each cell (node of neural net), we decide whether we are going to write, read or forget for this cell. So kind of, what is the probabilty of reading, writing, and forgetting at each cell is the parameter we are looking to tune this time. 

So we can have these gates as 0 or 1 to indicate gate open or close. But we use a continuous function on these gates, so that it is differentiable so we can take derivative and do back prop on it. 


All the little gates help keep the model remember longer when it needs to and ignore when in should not. So, now we are tuning all these Ws. 
LSTM Regularization:
L2 can always be used. Dropout works fine if we use it on input and output and not on future and past gates.



Sources:
1. http://karpathy.github.io/2015/05/21/rnn-effectiveness/
2.


No comments:

Post a Comment