Is it possible to only store the total gradient of a custom module

ProkaryMonster · August 9, 2020, 10:38am

I’ve pretrained a CNN module, which I want to insert to object detection deep neural networks.
However, the CNN module I’ve designed will first largely increase the number of feature channels, and then decrease it as the input number of feature channels, which may need to store large feature maps during the forward process or large gradient maps during the backward process and occupies too much CUDA memory, while I just need the total gradient of the module wrt the input feature map.
Since I don’t need the module parameters to be updated, I’ve set

for name, p in ce.module.named_parameters():
    p.requires_grad = False

But as it shows in my previous post, to ensure the module is differentiable, pytorch just doesn’t calculate the gradients wrt layer weights and the data size it stores might not be changed.
So I’m wondering if it is possible to multiply the gradient of module layer by layer during the forward process and during the backward process we just need to multiply it with the rest parts’ gradients, instead of saving inputs, outputs, and gradients per layer.

Is it possible to do such a thing in pytorch or is there a more elegant way to do it?

albanD · August 10, 2020, 8:33pm

Hi,

The way to do this is to implement a new custom Function (instructions here https://pytorch.org/docs/stable/notes/extending.html) that will encapsulate both these ops.
You will then have the control over what is saved for the backward and you will need to implement the backward yourself to match this new computation that you want to do.