Word-by-word Training an RNN

shavitamit · November 2, 2017, 2:34pm

The pytorch tutorials do a great job of illustrating a bare-bones RNN by defining the input and hidden layers, and manually feeding the hidden layers back into the network to remember the state. This flexibility then allows you to very easily perform teacher forcing.

Question 1: How do you perform teacher forcing when using the native nn.RNN() module (since the entire sequence is fed at once)? Example simple RNN network would be:

class SimpleRNN(nn.Module):

    def __init__(self, vocab_size,
                 embedding_dim,
                 batch_sz,
                 hidden_size=128,
                 nlayers=1,
                 num_directions=1,
                 dropout=0.1):

        super(SimpleRNN, self).__init__()

        self.batch_sz = batch_sz
        self.hidden_size = hidden_size

        self.encoder = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_size, nlayers, dropout=0.5)
        self.decoder = nn.Linear(hidden_size, vocab_size)

    def init_hidden(self):
        return autograd.Variable(torch.zeros(nlayers, batch_sz, hidden_size)).cuda()

    def forward(self, inputs, hidden):

        # -- encoder returns:
        # -- [batch_sz, seq_len, embed_dim]
        encoded = self.encoder(inputs) 
        _, seq_len, _ = encoded.size()

        # -- rnn returns:
        # -- output.size() = [seq_len, batch_sz, hidden_sz]
        # -- hidden.size() = [nlayers, batch_sz, hidden_sz]
        output, hidden = self.rnn(encoded.view(seq_len, batch_sz, embedding_dim), hidden)

        # -- decoder returns:
        # -- output.size() = [batch_sz, seq_len, vocab_size]
        output = F.log_softmax(decoder(output.view(batch_sz, seq_len, self.hidden_size)))

        return output, hidden

Where I can call the network with:

model = SimpleRNN(vocab_size, embedding_dim, batch_sz).cuda()
x_data, y_data = get_sequence_data(train_batches[0])
output, hidden = model(x_data, model.init_hidden())

Just for completeness, here are my shapes of x_data, output, and hidden:

print(x_data.size(), output.size(), hidden.size())
torch.Size([32, 80]) torch.Size([32, 80, 4773]) torch.Size([1, 32, 128])

Question 2: would it be possible to use this SimpleRNN network to then generate a sequence word-by-word, by first feeding it a <GO_TOKEN> and iterating until an <END_TOKEN> is reached? I ask because when I run this:

x_data = autograd.Variable(torch.LongTensor([[word2idx['<GO>']]]), volatile=True).cuda()
output, hidden = model(x_data, model.init_hidden(1))

print(output, output.sum())

I get an output of all 0s, and the output.sum() = 0. I get this even after training the network and backpropagating the loss. Any ideas why?

Question 3: If not terribly inefficient, is it possible to train the SimpleRNN network above word-by-word, analogous to the pytorch tutorial shown here (albeit there they’re training character-by-character).

SimonW · November 2, 2017, 3:17pm

There is no sampling in RNN’s forward pass. So teacher forcing should just work if I understand it correctly.
Since it’s already outputting a sequence, why do you need to iterate on it to get a sequence? In case of directly feeding <GO>, it seems that you are not giving it any input, so I’d expect it to fail.
You’d have to use RNN cells or input sequences of length 1. But why would that be useful to you in this case?

shavitamit · November 2, 2017, 3:21pm

Hi Simon

Thanks for your answers.

In regards to #1: I get what you’re saying. Fair point.
In regards to #2: how do I know the length of the generated sequence apriori? What tensor dimensions should I feed it? And with what values?
In regards to #3: I agree that doesn’t seem useful if I can overcome the point made in the second bullet above (generating the sequence word-by-word without fixing sequence length apriori)

SimonW · November 2, 2017, 4:04pm

A couple points about 2:

In your code, assuming your comments are correct, transforming to/from RNN input/output is a transpose on dim 0 and 1. View wouldn’t work in this case. I’d replace all those with transpose(0, 1) or just add batch_first = True to RNN constructor.
I played with log_softmax a bit and did some investigation. The default implicit dimension choice on 3d case may not be what you want. It’s something weird and will be deprecated soon. Next version will add a dim arg, which will be perfect for this use case. For now, I’d fix this by (including fix for 1)

        self.rnn = nn.RNN(embedding_dim, hidden_size, nlayers, dropout=0.5, batch_first = True)

......


        # -- encoder returns:
        # -- [batch_sz, seq_len, embed_dim]
        encoded = self.encoder(inputs) 
        _, seq_len, _ = encoded.size()

        # -- rnn returns:
        # -- output.size() = [seq_len, batch_sz, hidden_sz]
        # -- hidden.size() = [nlayers, batch_sz, hidden_sz]
        output, hidden = self.rnn(encoded, hidden)

        # -- decoder returns:
        # -- output.size() = [batch_sz, seq_len, vocab_size]
        dec_out = decoder(output)
        output = F.log_softmax(dec_out.view(batch_size * seq_len, -1).view(batch_sz, seq_len, self.hidden_size)

Finally, are you trying to do a seq2seq task where you don’t know the output seqlen before hand, such as translation? If that is the case enc-dec models where both encoder and decoder are recurrent are a better fit!

shavitamit · November 2, 2017, 4:44pm

Thanks Simon.

I am trying to do a seq2seq task and will definitely play around with encoder-decoder where both are RNNs. However I wanted to try with simple models first. I’ll make your edits and see if that works. I’m a little unclear now on when it’s appropriate to use view() and where it’s appropriate to use transpose(). I was under the impression that they do similar things.

shavitamit · November 2, 2017, 4:56pm

Regarding the code you posted, I think the last parameter in second .view() shouldn’t be self.hidden_size, since the output from the decoder has dimension [batch_sz, seq_len, vocab_size].

shavitamit · November 2, 2017, 5:00pm

Also - apologies if this is beating it to the ground, but can you explain the reasoning behind this:

.view(batch_sz * seq_len, -1).view(batch_sz, seq_len, self.vocab_size)

The output from the decoder is already in dimensions [batch_sz_seq_len, vocab_size] – what’s the point of resizing to [batch_sz*seq_len, -1] and then resizing again back to the original dimensions? Unless I’m misunderstanding something and you meant to actually write:

output = F.log_softmax(dec_out.view(batch_size * seq_len, -1)).view(batch_sz, seq_len, self.hidden_size)

SimonW · November 2, 2017, 5:16pm

Sorry for the typo, you are 100% correct that I missed a )

shavitamit · November 2, 2017, 5:20pm

Thanks - I added the missing ) and it did at fix some issues. The generated sequence doesn’t make much sense, perhaps because it’s not a sophisticated network - and maybe an encoder-decoder where both are recurrent would do better in this case.

The task I’m trying to do is a simple sequence generation. I have a bunch of tweets which I use to train the network, and I want to essentially generate a tweet by feeding a <GO> token and have the network output the next word in the generated tweet (word-by-word, since I don’t want to impose a fixed length on the tweet for now – no 140 char limit for now) until an <END> token is reached. Would you call that a seq2seq problem? It’s not exactly a translation problem.

SimonW · November 2, 2017, 5:21pm

They are very different. View changes the shape of the tensor without touching the underlying order. For example, say

A = some Tensor of (3, 4)
B = A.view(4, 3)
C = A.transpose(0, 1)

# A[1, 2] will be the 6th (1*4+2=6) value in flattened A
# B[1, 2] will be the 5th (1*3+2=5) value in flattened A
A[1, 2] == B[1, 3]
C[1, 2] == A[2, 1]  # transpose

Adding to this, you can do things like A.view(2, 2, 3) with no issue as long as the total number of element is the same.

SimonW · November 2, 2017, 5:22pm

The network is deterministic, you can’t expect it to output many different values for a single input token. At least give it some noise. Designing this will also involve potentially changing how to train your network.

shavitamit · November 2, 2017, 5:24pm

Okay – thanks. I’ll continue playing around for a bit.

Thanks again for your responses – appreciate the clarification of view() vs transpose()

SimonW · November 2, 2017, 5:31pm

Deep generative models are a relatively more advanced topic in DL, if you want to pursue this route to generate sequence data, text in your case, here are some useful papers:

Variational Autoencoder (VAE) based: https://arxiv.org/pdf/1511.06349.pdf, https://arxiv.org/pdf/1511.06038.pdf
Generative Adversarial Network (GAN) based: https://arxiv.org/pdf/1609.05473.pdf [SeqGAN]

Another route you can go about this is seq2seq language models, which should be a lot easier to build/train. To generate samples, you would need to feed some initial token. But that can be sampled by a distribution estimated with training data.

shavitamit · November 2, 2017, 5:34pm

Thanks for those references.

For some reason, when I manually wrote the RNN using as the char_rnn_classification tutorial showed (defining an input & hidden layer, concatenating them, then defining an input2output and an input2hidden layer), things worked pretty well actually. It’s when I decided to switch over to the nn.RNN() module that things started to get quite confusing, and the results degraded significantly. It’s a little unclear to me why that is at the moment.

SimonW · November 2, 2017, 5:38pm

There might some issue with how you use RNN module. Make sure that the final classification is from the last hidden state. And maybe tune the parameter a bit.

BTW, I updated the above reference with seq2seq language models.

shavitamit · November 2, 2017, 5:42pm

Thanks for those. As far as your comments, I do use the hidden layer from the last hidden state, like so:

        x_data, y_data = get_sequence_data([sample], train=False)

        model.zero_grad()
        hidden = model.init_hidden(x_data.size()[0])
        
        prev_word = autograd.Variable(torch.LongTensor([word2idx['<GO>']]), volatile=True).cuda()
        
        seq = [GO_TOKEN]

        while True:
            output, hidden = model(prev_word.unsqueeze(1), hidden)
            _, idx = output.topk(1)
            
            cur_word = idx.data[0][0][0]
            
            seq.append(cur_word)

            if cur_word == END_TOKEN:
                break

            elif len(seq) > 50:
                break

            prev_word = idx.view(-1)

SimonW · November 2, 2017, 5:46pm

You are not even feeding x_data into the network. It can never output something close to y_data. (assuming y_data largely depends on x_data)

shavitamit · November 2, 2017, 5:49pm

y_data is the same as x_data in this case but shifted by one word. Allows me to calculate the loss by predicting the next word and checking if it turned out to be that word.

But you’re right – fair point. This is actually a minimalistic example that I made for asking my question - and almost certainly that’s why I was getting bad results; I’ll add back in the other features to see if they help. In my actual network, I’m feeding in other features that would allow the network to distinguish between the <GO> token of input_1 vs the <GO> token of input_2.