RNN for sequence prediction

moskomule · January 25, 2017, 5:10am

Hello,

Previously I used keras for CNN and so I am a newbie on both PyTorch and RNN. In keras you can write a script for an RNN for sequence prediction like,

in_out_neurons = 1
hidden_neurons = 300

model = Sequential()  
model.add(LSTM(hidden_neurons, batch_input_shape=(None, length_of_sequences, in_out_neurons), return_sequences=False))  
model.add(Dense(in_out_neurons))  
model.add(Activation("linear"))

but when it comes to PyTorch I don’t know how to implement it. I directly translate the code above into below, but it doesn’t work.

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.rnn1 = nn.GRU(input_size=seq_len,
                            hidden_size=128,
                            num_layers=1)
        self.dense1 = nn.Linear(128, 1)

    def forward(self, x, hidden):
        x, hidden = self.rnn1(x, hidden)
        x = self.dense1(x)
        return x, hidden

    def init_hidden(self, batch_size):
        weight = next(self.parameters()).data
        return Variable(weight.new(128, batch_size, 1).zero_())

how can I implement something like the keras code? thank you.

apaszke · January 25, 2017, 4:43pm

The input_size argument to any RNN says how many features will there be for each step in a sequence, not what it’s length is going to be. Keras uses static graphs, so it needs to know the length of the sequence upfront, PyTorch has dynamic autodifferentiation so it doesn’t care about the sequence length - you can use a different one every iteration.

See the GRU docs for more details on the arguments.

Apart from this, your module looks good to me!

moskomule · January 26, 2017, 1:00pm

Thank you for your quick response, but the word features in a context of RNN is still unclear to me. The GRU doc says,

input : A (seq_len x batch x input_size) tensor containing the features of the input sequence.

and

input_size – The number of expected features in the input x

For example, if you input a sequence

[[[ 0.1,  0.2]],
 [[ 0.1,  0.2]],
 [[ 0.3,  0.1]]]

, then seq_len is 3, batch is 1 and input_size i.e. features is 2?

apaszke · January 26, 2017, 2:27pm

Correct. features or input_size says how many dimensions each data point has.

moskomule · January 27, 2017, 5:39am

Thanks a lot for your help, finally the code below works,

import torch
import torch.nn as nn
from torch.autograd import Variable

features = 1
seq_len = 10
hidden_size = 128
batch_size = 32

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.rnn1 = nn.GRU(input_size=features,
                            hidden_size=hidden_size,
                            num_layers=1)
        self.dense1 = nn.Linear(hidden_size, 1)

    def forward(self, x, hidden):
        x, hidden = self.rnn1(x, hidden)
        x = x.select(1, seq_len-1).contiguous()
        x = x.view(-1, hidden_size)
        x = self.dense1(x)
        return x, hidden

    def init_hidden(self):
        weight = next(self.parameters()).data
        return Variable(weight.new(1, batch_size, hidden_size).zero_())

model = Net()
model.cuda()
hidden = model.init_hidden()

X_train_1 = X_train[0:batch_size].reshape(seq_len,batch_size,features)
y_train_1 = y_train[0:batch_size]
model.zero_grad()
T = torch.Tensor
X_train_1, y_train_1 = T(X_train_1), T(y_train_1)
X_train_1, y_train_1 = Variable(X_train_1).cuda(), Variable(y_train_1).cuda()

output, hidden = model(X_train_1, Variable(hidden.data))

moskomule · January 29, 2017, 8:27am

Thanks for your helping, like I wrote above the script works, “literally” but the loss doesn’t decrease over the epochs, so give me some advice. I think the related parts are,

class Net(nn.Module):
    def __init__(self, features, cls_size):
        super(Net, self).__init__()
        self.rnn1 = nn.GRU(input_size=features,
                            hidden_size=hidden_size,
                            num_layers=1)
        self.dense1 = nn.Linear(hidden_size, cls_size)

    def forward(self, x, hidden):
        x, hidden = self.rnn1(x, hidden)
        x = x.select(0, maxlen-1).contiguous()
        x = x.view(-1, hidden_size)
        x = F.softmax(self.dense1(x))
        return x, hidden

    def init_hidden(self, batch_size=batch_size):
        weight = next(self.parameters()).data
        return Variable(weight.new(1, batch_size, hidden_size).zero_())

def var(x):
    x = Variable(x)
    if cuda:
        return x.cuda()
    else:
        return x

model = Net(features=features, cls_size=len(chars))
if cuda:
    model.cuda()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)

def train():
    model.train()
    hidden = model.init_hidden()
    for epoch in range(len(sentences) // batch_size):
        X_batch = var(torch.FloatTensor(X[:, epoch*batch_size: (epoch+1)*batch_size, :]))
        y_batch = var(torch.LongTensor(y[epoch*batch_size: (epoch+1)*batch_size]))
        model.zero_grad()
        output, hidden = model(X_batch, var(hidden.data))
        loss = criterion(output, y_batch)
        loss.backward()
        optimizer.step()

for epoch in range(nb_epochs):
    train()

the input is “one-hot” vector and I tried changing its learning rate but the result is the same.

apaszke · January 29, 2017, 10:28am

I’m not sure, it’s hard to spot bugs in code that you can’t run. Why do you do this:

x = x.select(0, maxlen-1).contiguous()

Don’t you want to return predictions for the whole sequence? It seems to me that you’re only taking the last output.

moskomule · January 30, 2017, 12:17pm

In fact I’m trying to (re)implement keras’s text generation example with PyTorch. In keras’s Recurrent layers, there is

return_sequences: Boolean. Whether to return the last output in the output sequence, or the full sequence.

and in the example, this is false, so I think taking only the last output is needed.

apaszke · January 30, 2017, 8:02pm

I’m not sure, I don’t know Keras. I’m just pointing it out (it might be easier to do x[-1] to achieve the same thing).

If you have the full code available somewhere I can take a look.

moskomule · January 30, 2017, 10:59pm

OK, thanks. Does

x = x[-1] i.e. x = x.select(0, maxlen-1).contiguous()

interfere in back propagation?

I uploaded my code here

apaszke · January 30, 2017, 11:17pm

How would they interfere? They both should be ok.

moskomule · January 31, 2017, 5:10am

I’m not certain but I use only the last output and so I think this may give bad influence on back prop.
I’ll check the keras again. Thank you.

moskomule · February 1, 2017, 9:24am

Finally I found that I misused the loss function torch.nn.CrossEntropyLoss. I changed the loss function to nn.NLLLoss(log_softmax(output), target) then the loss decreases as expected.

apaszke · February 1, 2017, 11:12am

And you removed the softmax from the module, right?

moskomule · February 2, 2017, 3:09am

Right. So now,

class Net(nn.module):
    ...
    def forward(self, x, hidden):
        x, hidden = self.rnn1(x, hidden)
        x = x.select(0, maxlen-1).contiguous()
        x = x.view(-1, hidden_size)
        x = F.relu(self.dense1(x))
        x = F.log_softmax(self.dense2(x))
        return x, hidden
...
criterion = nn.NLLLoss()
...
def train():
    model.train()
    hidden = model.init_hidden()
    for epoch in range(len(sentences) // batch_size):
        X_batch = var(torch.FloatTensor(X[:, epoch*batch_size: (epoch+1)*batch_size, :]))
        y_batch = var(torch.LongTensor(y[epoch*batch_size: (epoch+1)*batch_size]))
        model.zero_grad()
        output, hidden = model(X_batch, var_pair(hidden))
        loss = criterion(output, y_batch)
        loss.backward()
        optimizer.step()

apaszke · February 2, 2017, 4:29pm

Yup, that looks good! Note that you can now pass in hidden = None in the first iteration. The RNN will initialize a zero-filled hidden state for you. You might need to update pytorch though.

CodePothunter · February 15, 2017, 6:20am

I have a question about the the number of parameters in RNN. I defined a RNN layer and get its paramters. I thought the number of parameters in a RNN layer should differ from different input lengths. However, when I use parameters() to get its parameters, the number of parameters seemed similar to that of the RNN layer with only one time steps.

How to understand this fact? Thank you!

Atcold · February 15, 2017, 3:09pm

Your model is going to be the same, whatever is the length of your input.
In Torch we used to clone the model as many times as the time steps while sharing the parameters, because it is the same model, just over time.
The number of parameters will change when your input dimensionality will change (the size of x[t], for a given t = 1, ..., T), and not when T changes.

If it is still not clear, you can go over my lectures on RNNs (ref.).
And if it is still confusing, wait for the PyTorch video tutorials I’m currently working on.

CodePothunter · February 16, 2017, 1:33am

I see. Thank you very much!

osm3000 · April 5, 2017, 11:35am

Hi,

Sorry for reopening this topic. I also just moved to PyTorch from Keras, and I am super confused about how RNN works.
I am confused about:

I don’t understand what is the ‘batch’ mean in the context of PyTorch
Since RNN can accept variable length sequences, can someone please make a small example about this?
What is the difference between RNN cell and RNN?
http://pytorch.org/docs/nn.html#torch.nn.RNNCell
http://pytorch.org/docs/nn.html#rnn
In RNN cell, why the documentation says the input is input (batch, input_size) , while in the example given in the documentation, the input is input = Variable(torch.randn(6, 3, 10)) ?

Thank you