RNN for sequence prediction

In fact I’m trying to (re)implement keras’s text generation example with PyTorch. In keras’s Recurrent layers, there is

  • return_sequences: Boolean. Whether to return the last output in the output sequence, or the full sequence.

and in the example, this is false, so I think taking only the last output is needed.

I’m not sure, I don’t know Keras. I’m just pointing it out (it might be easier to do x[-1] to achieve the same thing).

If you have the full code available somewhere I can take a look.

OK, thanks. Does

x = x[-1] i.e. x = x.select(0, maxlen-1).contiguous()

interfere in back propagation?

I uploaded my code here

How would they interfere? They both should be ok.

I’m not certain but I use only the last output and so I think this may give bad influence on back prop.
I’ll check the keras again. Thank you.

Finally I found that I misused the loss function torch.nn.CrossEntropyLoss. I changed the loss function to nn.NLLLoss(log_softmax(output), target) then the loss decreases as expected.

And you removed the softmax from the module, right?

Right. So now,

class Net(nn.module):
    ...
    def forward(self, x, hidden):
        x, hidden = self.rnn1(x, hidden)
        x = x.select(0, maxlen-1).contiguous()
        x = x.view(-1, hidden_size)
        x = F.relu(self.dense1(x))
        x = F.log_softmax(self.dense2(x))
        return x, hidden
...
criterion = nn.NLLLoss()
...
def train():
    model.train()
    hidden = model.init_hidden()
    for epoch in range(len(sentences) // batch_size):
        X_batch = var(torch.FloatTensor(X[:, epoch*batch_size: (epoch+1)*batch_size, :]))
        y_batch = var(torch.LongTensor(y[epoch*batch_size: (epoch+1)*batch_size]))
        model.zero_grad()
        output, hidden = model(X_batch, var_pair(hidden))
        loss = criterion(output, y_batch)
        loss.backward()
        optimizer.step()

Yup, that looks good! Note that you can now pass in hidden = None in the first iteration. The RNN will initialize a zero-filled hidden state for you. You might need to update pytorch though.

1 Like

I have a question about the the number of parameters in RNN. I defined a RNN layer and get its paramters. I thought the number of parameters in a RNN layer should differ from different input lengths. However, when I use parameters() to get its parameters, the number of parameters seemed similar to that of the RNN layer with only one time steps.

How to understand this fact? Thank you!

Your model is going to be the same, whatever is the length of your input.
In Torch we used to clone the model as many times as the time steps while sharing the parameters, because it is the same model, just over time.
The number of parameters will change when your input dimensionality will change (the size of x[t], for a given t = 1, ..., T), and not when T changes.

If it is still not clear, you can go over my lectures on RNNs (ref.).
And if it is still confusing, wait for the PyTorch video tutorials I’m currently working on.

I see. Thank you very much!

Hi,

Sorry for reopening this topic. I also just moved to PyTorch from Keras, and I am super confused about how RNN works.
I am confused about:

  1. I don’t understand what is the ‘batch’ mean in the context of PyTorch
  2. Since RNN can accept variable length sequences, can someone please make a small example about this?
  3. What is the difference between RNN cell and RNN?
    http://pytorch.org/docs/nn.html#torch.nn.RNNCell
    http://pytorch.org/docs/nn.html#rnn
  4. In RNN cell, why the documentation says the input is input (batch, input_size) , while in the example given in the documentation, the input is input = Variable(torch.randn(6, 3, 10)) ?

Thank you

1 & 4)
batch is the number of samples within the minibatch

The dimension corresponding to the batch vary, depending on the batch_first argument of the RNN modules.

By default you process samples with (timesteps, batch_samples, input_size), while if batch_first=True the RNN will consider sequence with a format of (batch_samples, timesteps, input_size), just like Keras does.


Check the documentation for pack_padded_sequences and pad_packed_sequences.

Basically you have to pass your input in a PackedSequence format, which contain sequence lengths information, and PyTorch’s RNN native modules will deal with the variable lengths without the need of masking explicitly.

RNNCell does the forward pass for a single time step of a sequence.
RNN applies the RNNCell forward pass to every time step of an input sequence -> this is your traditional RNN

@miguelvr Thank you for your reply.
Just to make sure I understand, you are talking about RNN layer
http://pytorch.org/docs/nn.html#rnn
where you say that input (seq_len, batch, input_size) is equivalent to input (timesteps, batch, input_size). Am I correct?

you’re correct…

I edited my previous post to answer your other questions.

1 Like

Thank you very much @miguelvr , this is much clearer now

Hi, I have a small question that is relevant to @osm3000’s q3:
that is I am trying to figure out the different and relation between RNN and RNNCell, (or LSTM and LSTMCell)…
Say we assume that we only have 1 layer, according to @miguelvr’s answer and the documentation,
it seems like LSTMCell (or RNNCell) allows me to process each time step separately.
While LSTM (or RNN), we put entire sequence of input into it, and we got entire outputs.

If the above understands are correct, then my question is why I get totally different results when I tried to process a sequence of data.
I follow the simple example in the document, i.e.

rnn = nn.LSTMCell(10, 20) 
input = Variable(torch.randn(6, 3, 10))
hx = Variable(torch.randn(3, 20))
cx = Variable(torch.randn(3, 20))
output = []
for i in range(6):
     hx, cx = rnn(input[i], (hx, cx))
     output.append(hx)

and I also use the same set of data (I didn’t random again but the same data), and put the entire input into the LSTM as follow:

lstm = nn.LSTM(10, 20) ## layer = 1

and I compare the output results from both strategies.
But they gave me totally different results…
I am wondering if it because the underlying implementation, or I use it wrongly?

thank you!

2 Likes

Hi @jdily, I encountered the same problem! Did you figure out? I posted my problem here.

Don’t we just do x[:,:,-1] to get the last output and then pass it to Dense layer?