Question about loss calculation and backpropagation

Hi,

I am creating an auto-encoder with batch training for a sequence prediction task. I have faced an issue which I suspect is because the backpropagation doesn’t propagate correctly through the entire network. As a result, the error doesn’t go down. Is there a way to check that backward back-propagates through which layers?

Here is how my code look like:

    class autoencoder(nn.Module):

        def __init__(self, ...):
            super(autoencoder, self).__init__()

            self.embedding = nn.Embedding(input_size, embedding_size)
            self.encoder = nn.GRU(input_size=embedding_size,
                          hidden_size=enc_hidden_size,
                          num_layers=1)

            self.decoder = nn.GRU(input_size=enc_hidden_size,
                          dec_hidden_size=dec_hidden_size,
                          num_layers=1)


            self.linear = nn.Linear(.dec_hidden_length*dec_layers, input_length)
    
            self.logsoftmax = nn.LogSoftmax()
        

        def forward(input_batch, target_batch):

            input = Variable(input_batch)

            input = self.embedding(input)

            # output of embedding is (batch_size x seq_len x embeddings_dim) but input of GRU is (seq_len x batch_size x input_size)
            # so the following transpose is necessary
            input = torch.transpose(input, 0, 1)

            input = torch.nn.functional.relu(input)

            h1 = Variable(torch.zeros(num_layers, batch_size, hidden_size))

            out1, h1 = self.encoder(input, h1)

            target = Variable(target_batch)

            # concatenate a start_of_sentence token to the beginning of each sequence
            sos = torch.from_numpy(numpy.full((batch_size, 1), index_of_sos))
            # each batch is a 2d matrix where rows represent sentences
            target = Variable(torch.cat((sos, target), 1))
             
            target = self.embedding(target)
            
            target = torch.nn.functional.relu(target)

            out2, h2 = self.decoder(target, h1)  # out2 is (seq_length x batch_size x hidden_length)

            out2 = self.linear(out2.view(-1, out2.size(2)))

            out2 = self.logsoftmax(out2)

            return(out2) 

Then, in another function train() I call forward with the right input arguments.

ae = autoencoder(...)
optimizer = torch.optim.SGD(ae.parameters(), lr=0.05)
criterion = torch.nn.NLLLoss()

for it in range(0, num_iterations):

    for i in range(0, len(input_batches)):

        ae.zero_grad()

        result = ae(input_batch, target_batch)

        # concatenate an end_of_sentence token to the end of each sequence
        eos = torch.from_numpy(numpy.full((batch_size, 1), index_of_eos))
        tar_batch = Variable(torch.FloatTensor(torch.cat((target_batches[i], eos), 1)))

        error = criterion(result, Variable(tar_batch))

        error.backward()

        optimizer.step()

When I train the model, the error goes down only a little bit but then stops (fluctuates) there. More specifically, the error stays quite high even if I dramatically decrease the size of the training data and increase the number of epochs. I wonder if my code correctly utilizes backward and torch.optim. I am not sure if result being returned by the autoencoder stores the entire computational graph.

1 Like

Why are you applying relu to your target?

There was a mistake that I fixed in the above code. I apply relu after embedding. Papers say it improves the performance. Can’t remember the exact details though. You think it can harm?

No, I’d forgotten it was after the embedding. Ok, reasonable. So, thats not hte reason then :slight_smile:

is zero_grad working? since you are not registering the encoder and decoder etc with the Module in any way, (I think?), so zero_grad wont zero their gradients? Do you need to somehow register the modules into the nn.Module parent class?

Can you elaborate more on what you mean by registering the encoder and decoder in the parent class? Also, I corrected another mistake in the above snippet about how the output view is changed and “softmaxed” before being returned so that it is suitable for error calculation. Sorry about the mistake in the original post.

Apparently, they’re registered automatically, using the fomrulation you are already using, see screenshot below. So, thats not the reason :slight_smile:

torch.nn — PyTorch master documentation :

1 Like

wait, is your indentation correct? seems like a bunhc of your module bits are outside fo the __init__?

This indentation looks strange too:

It seems strange here, you are right. Mainly because I applied some edits, but yes in the actual code the indentations look quite ok. Fixing the indentation here…

1 Like

Hmmm. Kind of looks ok.

  • what happens if you train on eg just two examples?
  • what do you mean by ‘error looks quite high’? I think my own criterion for getting working on two examples would be, something like: decoder gets at least first 2-3 words/characters (depending on if word or character model) correct.

By the way, when I tried a simple encoder-decoder, I added ‘teacher forcing’ to both the encoder and decoder. In Sean Robertson’s gtutorial, he adds it to at least the decoder, http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

I have some working seq2seq code here: https://github.com/hughperkins/pub-prototyping/blob/master/papers/attention/seq2seq_noattention_trainbyparts.py

  • I mean loss by error.

  • when I try on two examples, loss goes down but very slowly, and even after 1K iterations, when I test the model on the same data used for training the results look pretty bad.

  • I am already using teacher forcing both for the encoder and the decoder. I guess there is no other way to train a recurrent network with batch training if you don’t use teacher forcing.

  • Thanks for the link to your code. I had a brief look at it but will come back to it to have a more profound understanding. To train your model you use one sentence at a time and not a batch of sentences, right?

yes, my code is pretty much the stupidest simplest code I could come up wtih :slight_smile: . but at least it provides osmething to compare to

by the way, did you try playign around with different learning rates?

Yes I tried a range of learning rates, but it doesn’t seem to have a major impact on the outcome.

how about trying things like:

  • look at the encoder output, does the encoder correctly predict the next letter?

but wait, you are not currently using teacher forcing for the encoder. Perhaps you can add that?

well… what makes you feel you are using teacher forcing on the encoder? what do you feel I mean by ‘teacher forcing’?

As far as I understand, teacher forcing means not to (wait for) pass the output of time step t-1 to time step t as input. Instead, use the known inputs. For the encoder, the inputs would be the words of the input sentence. For the decoder, the inputs would be the words of the target sentence. Isn’t it like that?

well… yeah, I guess you’re right.

So, maybe my meaning is not quite right. What I mean is: add the encoder predicted next words/letters to the loss function.

Did your ae eventually worked? What did you have to change?