GPU Memory Usage During the Pre-training

Hi All,

My question is that is it necessary to store the gradient/feature maps of the frozen (require gradient = False) non-linear intermediate layers of a Conv Neural Network?

My question comes from pre-training a network based on the following observations.

  1. If I fix (freeze) the low-level layers of a network and only update the weight of the higher-level layers, the Pytorch frees some memory (no need to save the feature and gradient maps for the frozen layers). So it will save some memory by fixing the low-level layers.
  2. If I only fix the intermediate layers (neither high-level nor low-level layers), the memory usage is the same as when I update all weights (including low, intermediate and high-level layers).

So my question is would it be possible for Pytorch to free the gradient/feature maps of these frozen intermediate layers to save some memory?

My initial thoughts are if all the frozen intermediate layers are linear operations (though often not this case of a Conv Neural Nets), so we don’t need to save the gradient/feature maps since the whole intermediate layers basically only do a linear operation. But if there are some non-linear operations (conv, relu, etc) in the frozen layers, is it still necessary to store the gradient/feature maps of these non-linear frozen layers?

Best,
Zhuotun

Hi,

To be able to compute gradients for the previous layers, a layer that is frozen still need to compute the gradient wrt it’s input.
Each layer should be implemented such that it only saves what it will need to compute the backward pass for the inputs that require gradients.

1 Like

Hi @albanD ,

I am coming to this thread because I am training a simple linear layer in front of a frozen LLM but seeing GPU memory usages similar to just training the LLM altogether. Would this come from those stored gradients within the LLM module to calculate the linear layer in front, even though these LLM layers are frozen with requires_grad=False?
Would they be stored just in the “.weight.grad” attribute as they usually would?

I somehow cannot reproduce this behaviour with the following dummy network:

class TwoLayerNet(nn.Module):
    def __init__(self):
        super(TwoLayerNet, self).__init__()

        self.layer1 = nn.Linear(in_features=1024, out_features=1024)
        self.relu1 = nn.ReLU()

        self.layer2 = nn.Linear(in_features=1024, out_features=1024)
        self.relu2 = nn.ReLU()

        # Freeze the parameters of the second layer
        for param in self.layer2.parameters():
            param.requires_grad = False

    def forward(self, x):
        x = self.layer1(x)
        x = self.relu1(x)

        x = self.layer2(x)
        x = self.relu2(x)
        return x

When checking layer2.weight.grad after loss.backward() it is just None, while layer1.weight.grad contains the gradients as expected.

It would be great if you could bring some clarity here!

Thank you in advance

Hi,

Most of the memory usage here is about the intermediaries and what is saved for backward. Not the gradient Tensor itself. So it is indeed expected as we need to backprop through the LLM.
You can use the checkpoint tool to do activation checkpointing to reduce the memory usage by using more compute.

1 Like