Understanding RNN

I have not had much experience with RNNs and have been looking at some examples in the pytorch repository and I have a question about the example provided here: https://github.com/pytorch/examples/blob/master/word_language_model/model.py

In this example, the RNNModel forward function looks as follows:

def forward(self, input, hidden):
        emb = self.drop(self.encoder(input))
        output, hidden = self.rnn(emb, hidden)
        output = self.drop(output)
        decoded = self.decoder(output)
        return decoded, hidden

I have 2 questions:

1: Why is the output being used in the decoder step? I thought the decoder should take the hidden state as input? Should this not be self.decoder(hidden)?

2: Assuming there is an explanation for (1), why is usually the dropout used on the output? I am guessing the output here is a tensor. So, how is dropout applied on a tensor? I thought dropout is a property of the weights?

I hope my questions make some sense.

  1. output will contain the output features of the last layer in your rnn for all time steps, while hidden will contain the features for the last step.
    This small code example shows the equality:
rnn = nn.RNN(10, 20)
input = torch.randn(5, 3, 10)
h0 = torch.randn(1, 3, 20)
output, hn = rnn(input, h0)

print((output[-1] == hn).all())
> tensor(true)
  1. Dropout will drop output activations randomly as shown here:
lin = nn.Linear(10, 10)
x = torch.randn(1, 10)
out = lin(x)
print(F.dropout(out, 0.5))
1 Like

Thank you for your answer! This really clarifies a few things for me. Many thanks.

Looking at the code again last night, there is also one more thing that stuck out. Usually these auto encoder like things have a RNN for encoding and another one for decoding (recon). However, in this code, the decoder is just a linear layer. I wonder why that is?

self.decoder = nn.Linear(nhid, ntoken)
decoded = self.decoder(output)

Why should the reconstruction be a linear combination of the output? I am trying to wonder what this pytorch example is trying to achieve with this? I also ask because I have seen similar things in RNN based anomaly detection but am confused about the role of this linear decoder layer.

I’m not sure about the architecture decision for this model and I would guess @vdw might know more about this particular decoding strategy. :confused:

However, if you are looking for an encoder and decoder model with RNNs in both parts, have a look at the Seq2Seq tutoorial :slight_smile:

1 Like

Well, I had a look at the code. While I’m not familiar with this setup for a Language Model (LM) either, a look at the training data made a bit clearer to me. Still, everything that follows are not much more than educated guesses.

Most fundamentally, an LM aims to predicts the next word given an input sequence of words. Most LM architectures reflect this “directly” in the training date. For example, given a long document like “A B C D E F G H I J K L M N O P …”, the training data (input => output) is generates as (assuming an input sequence of 5)

A B C D E => F
B C D E F => G
C D E F G => H
D E F G H => I

So yeah, in this case you would simple pump input through the RNN and use the hidden state as input for the final linear layers, e.g.:

output, hidden = self.rnn(emb, hidden)
# optional dropout or what not
decoded = self.decoder(hidden)

That essentially simply an RNN-based classifier where the classes are all the word in the vocabulary.

The architecture here works differently. If I understand in example/main.py currectly, the training data is generated differently:

A B C D E => B C D E F

(the structure of subsequent training examples) depend on the batch size; check the methods batchify() and get_batch(). Anyway, the output sequence is the input sequence shifted by one word to the left.

No they treat the whole training as a sequence labeling task, which is kind of a prediction for/after each time time step and not just the last. As @ptrblck said, output[i] is the hidden state of step i. I always use the figure below to check if I use the right return values of an LSTM/GRU:

Anyway, that’s how understand the proposed architecture. I assume it yields better results compared to more traditional architectures. I hope that helps…well, at least gives some food for thoughts.


Thank you for such a detailed reply. This really clears a lot of doubts!