Backpropagation through functions (skip layer)


I am really sorry for this question which is probably trivial to answer but I am struggling to find an answer and I am a beginner.

I am trying to implement a simple autoencoder following the RedNet architecture designed for denoising (I know there exists many implementations but it is just a way for me to understand autoencoder). The encoder and decoder are symetric (stride: 2 for the encoder / upsample X2 for the decoder) and this architecture uses skip layers: the output of some layers of the encoder are directly summed with the corresponding output of the decoder (i.e the output of the first deconvolution will be summed with the output of the second last deconvolution layer of the decoder).

Here is my code (for the encoder decoder):

def encoder(self, features):

    x = self.enc_conv_1(features)
    fMap1 = F.leaky_relu(x) 
    x = self.enc_conv_2(fMap1)
    fMap2 = F.leaky_relu(x)
    x = self.enc_conv_3(fMap2)
    fMap3 = F.leaky_relu(x)

    encoded = self.enc_linear(fMap3)

    return encoded, fMap1, fMap2

def decoder(self, encoded, fMap1, fMap2):
    x = self.dec_linear_1(encoded)
    x = x.view(-1, 64, 64, 64)

    x = self.dec_deconv_1(x)
    x = x + fMap2
    x = F.leaky_relu(x)

    x = self.dec_deconv_2(x)
    x = x + fMap1
    x = F.leaky_relu(fx)

    x = self.dec_deconv_3(x)
    x = F.leaky_relu(x)

    decoded = torch.sigmoid(x)
    return decoded

def forward(self, features):
    encoded, fMap1, fMap2 = self.encoder(features)
    decoded = self.decoder(encoded,fMap1, fMap2)
    return decoded

My question is, when gradients are backpropagated will the autograd mechanism also backpropate gradients directly through the “skip path”, for instance, will the gradient backpropagate through all the network but also diretly from the second last layer of the decoder back to the first layer of the encoder?

If not, how can this be done, do I need to merge the encoder and decoder function within the same function?

I hope my question is clear. I thank you for your help.

Kind regards

Hi Francois,

Autograd will track all operations in the forward pass in a computation graph and use it in the backward pass to calculate the gradients for all parameters.

You can pass the output of one module/model to another one. As long as you don’t detach the activations e.g. via tensor = tensor.detach(), it should work out of the box.

Let us know, if you need more information.

Thank you very much for your quick reply. I just checked if there were any difference by merging the encoder and decoder functions but no … it produces the same result.

I also realised that basically using this “skip layer” technique probably imply that you need to have a fair amount of noise on your input data otherwise the network will not learning anything due to the skip layer mechanism.

What I don’t understand though with my test is that I train my network with only one image (again, for testing) that I blurred. The network seems to learn how to go from the blurred image to the unblurred one. However, during my evaluation, if I input any other image my network will always output the sharp version of the image used for training… I guess I have a massive bug in my code because it does not make any sense.

If you have any idea or extra explanation ?
Thank you again for your answer.

If you are using a single image during training, your model overfits on this particular sample.
It’s not learning to deblur images, but to output your desired target.

It’s useful as a sanity test to check if your code has some hidden bugs, but eventually you would have to scale up your experiment.

Ok, thank you. It just sounds a bit curious to me as during the evaluation the weight are fixed and the code generated by the encoder will obviously be different than the one generated during the training (again using a single image different from the evaluation image). As such, I was expecting a really strange output which would not make any sense.

Yeah, that sounds also reasonable.
However, some of my models were just killing the complete input signal (making it negative, so that relu will set it to zero), while the last output bias was creating my prediction. :wink:

Oh. Ok that definitely makes sense now. I will try this with a larger dataset and see what will be learned by the network. My hope is that, by using different size of blur kernel, the network will still able to learn useful features rather that the process of debluring images per se.

Thank you a lot for your answer, it is really helpful.