Question about loss calculation and backpropagation

mfa · July 13, 2017, 9:33am

Hi,

I am creating an auto-encoder with batch training for a sequence prediction task. I have faced an issue which I suspect is because the backpropagation doesn’t propagate correctly through the entire network. As a result, the error doesn’t go down. Is there a way to check that backward back-propagates through which layers?

Here is how my code look like:

    class autoencoder(nn.Module):

        def __init__(self, ...):
            super(autoencoder, self).__init__()

            self.embedding = nn.Embedding(input_size, embedding_size)
            self.encoder = nn.GRU(input_size=embedding_size,
                          hidden_size=enc_hidden_size,
                          num_layers=1)

            self.decoder = nn.GRU(input_size=enc_hidden_size,
                          dec_hidden_size=dec_hidden_size,
                          num_layers=1)


            self.linear = nn.Linear(.dec_hidden_length*dec_layers, input_length)
    
            self.logsoftmax = nn.LogSoftmax()
        

        def forward(input_batch, target_batch):

            input = Variable(input_batch)

            input = self.embedding(input)

            # output of embedding is (batch_size x seq_len x embeddings_dim) but input of GRU is (seq_len x batch_size x input_size)
            # so the following transpose is necessary
            input = torch.transpose(input, 0, 1)

            input = torch.nn.functional.relu(input)

            h1 = Variable(torch.zeros(num_layers, batch_size, hidden_size))

            out1, h1 = self.encoder(input, h1)

            target = Variable(target_batch)

            # concatenate a start_of_sentence token to the beginning of each sequence
            sos = torch.from_numpy(numpy.full((batch_size, 1), index_of_sos))
            # each batch is a 2d matrix where rows represent sentences
            target = Variable(torch.cat((sos, target), 1))
             
            target = self.embedding(target)
            
            target = torch.nn.functional.relu(target)

            out2, h2 = self.decoder(target, h1)  # out2 is (seq_length x batch_size x hidden_length)

            out2 = self.linear(out2.view(-1, out2.size(2)))

            out2 = self.logsoftmax(out2)

            return(out2)

Then, in another function train() I call forward with the right input arguments.

ae = autoencoder(...)
optimizer = torch.optim.SGD(ae.parameters(), lr=0.05)
criterion = torch.nn.NLLLoss()

for it in range(0, num_iterations):

    for i in range(0, len(input_batches)):

        ae.zero_grad()

        result = ae(input_batch, target_batch)

        # concatenate an end_of_sentence token to the end of each sequence
        eos = torch.from_numpy(numpy.full((batch_size, 1), index_of_eos))
        tar_batch = Variable(torch.FloatTensor(torch.cat((target_batches[i], eos), 1)))

        error = criterion(result, Variable(tar_batch))

        error.backward()

        optimizer.step()

When I train the model, the error goes down only a little bit but then stops (fluctuates) there. More specifically, the error stays quite high even if I dramatically decrease the size of the training data and increase the number of epochs. I wonder if my code correctly utilizes backward and torch.optim. I am not sure if result being returned by the autoencoder stores the entire computational graph.

hughperkins · July 13, 2017, 9:37am

Why are you applying relu to your target?

mfa · July 13, 2017, 9:43am

There was a mistake that I fixed in the above code. I apply relu after embedding. Papers say it improves the performance. Can’t remember the exact details though. You think it can harm?

hughperkins · July 13, 2017, 9:44am

No, I’d forgotten it was after the embedding. Ok, reasonable. So, thats not hte reason then

hughperkins · July 13, 2017, 9:46am

is zero_grad working? since you are not registering the encoder and decoder etc with the Module in any way, (I think?), so zero_grad wont zero their gradients? Do you need to somehow register the modules into the nn.Module parent class?

mfa · July 13, 2017, 9:55am

Can you elaborate more on what you mean by registering the encoder and decoder in the parent class? Also, I corrected another mistake in the above snippet about how the output view is changed and “softmaxed” before being returned so that it is suitable for error calculation. Sorry about the mistake in the original post.

hughperkins · July 13, 2017, 10:26am

Apparently, they’re registered automatically, using the fomrulation you are already using, see screenshot below. So, thats not the reason

torch.nn — PyTorch master documentation :

hughperkins · July 13, 2017, 10:27am

wait, is your indentation correct? seems like a bunhc of your module bits are outside fo the __init__?

hughperkins · July 13, 2017, 10:32am

This indentation looks strange too:

mfa · July 13, 2017, 11:20am

It seems strange here, you are right. Mainly because I applied some edits, but yes in the actual code the indentations look quite ok. Fixing the indentation here…

hughperkins · July 13, 2017, 11:35am

Hmmm. Kind of looks ok.

what happens if you train on eg just two examples?
what do you mean by ‘error looks quite high’? I think my own criterion for getting working on two examples would be, something like: decoder gets at least first 2-3 words/characters (depending on if word or character model) correct.

By the way, when I tried a simple encoder-decoder, I added ‘teacher forcing’ to both the encoder and decoder. In Sean Robertson’s gtutorial, he adds it to at least the decoder, http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

I have some working seq2seq code here: https://github.com/hughperkins/pub-prototyping/blob/master/papers/attention/seq2seq_noattention_trainbyparts.py

mfa · July 13, 2017, 1:23pm

I mean loss by error.
when I try on two examples, loss goes down but very slowly, and even after 1K iterations, when I test the model on the same data used for training the results look pretty bad.
I am already using teacher forcing both for the encoder and the decoder. I guess there is no other way to train a recurrent network with batch training if you don’t use teacher forcing.
Thanks for the link to your code. I had a brief look at it but will come back to it to have a more profound understanding. To train your model you use one sentence at a time and not a batch of sentences, right?

hughperkins · July 13, 2017, 1:25pm

yes, my code is pretty much the stupidest simplest code I could come up wtih . but at least it provides osmething to compare to

by the way, did you try playign around with different learning rates?

mfa · July 13, 2017, 1:31pm

Yes I tried a range of learning rates, but it doesn’t seem to have a major impact on the outcome.

hughperkins · July 13, 2017, 1:32pm

how about trying things like:

look at the encoder output, does the encoder correctly predict the next letter?

hughperkins · July 13, 2017, 1:33pm

but wait, you are not currently using teacher forcing for the encoder. Perhaps you can add that?

hughperkins · July 13, 2017, 1:33pm

well… what makes you feel you are using teacher forcing on the encoder? what do you feel I mean by ‘teacher forcing’?

mfa · July 13, 2017, 1:40pm

As far as I understand, teacher forcing means not to (wait for) pass the output of time step t-1 to time step t as input. Instead, use the known inputs. For the encoder, the inputs would be the words of the input sentence. For the decoder, the inputs would be the words of the target sentence. Isn’t it like that?

hughperkins · July 13, 2017, 1:41pm

well… yeah, I guess you’re right.

So, maybe my meaning is not quite right. What I mean is: add the encoder predicted next words/letters to the loss function.

silpara · March 28, 2018, 7:23pm

Did your ae eventually worked? What did you have to change?