GPU Memory Usage During the Pre-training

Hi All,

My question is that is it necessary to store the gradient/feature maps of the frozen (require gradient = False) non-linear intermediate layers of a Conv Neural Network?

My question comes from pre-training a network based on the following observations.

  1. If I fix (freeze) the low-level layers of a network and only update the weight of the higher-level layers, the Pytorch frees some memory (no need to save the feature and gradient maps for the frozen layers). So it will save some memory by fixing the low-level layers.
  2. If I only fix the intermediate layers (neither high-level nor low-level layers), the memory usage is the same as when I update all weights (including low, intermediate and high-level layers).

So my question is would it be possible for Pytorch to free the gradient/feature maps of these frozen intermediate layers to save some memory?

My initial thoughts are if all the frozen intermediate layers are linear operations (though often not this case of a Conv Neural Nets), so we don’t need to save the gradient/feature maps since the whole intermediate layers basically only do a linear operation. But if there are some non-linear operations (conv, relu, etc) in the frozen layers, is it still necessary to store the gradient/feature maps of these non-linear frozen layers?

Best,
Zhuotun

Hi,

To be able to compute gradients for the previous layers, a layer that is frozen still need to compute the gradient wrt it’s input.
Each layer should be implemented such that it only saves what it will need to compute the backward pass for the inputs that require gradients.

1 Like