Why Doesn’t My Tensor Offloading Strategy Reduce GPU Memory Usage in Forward Pass?

I am training a model using PyTorch and would like to offload tensors to CPU memory during the forward pass as soon as they are no longer needed, and then reload them to GPU memory during the backward pass if required. This can be achieved using torch.autograd.graph.save_on_cpu(), but why doesn’t the following code work as expected?

class DeepModel(nn.Module):
    def __init__(self, layer_num=40, layer_size=4096):
        super(DeepModel, self).__init__()
        self.layers = nn.ModuleList([nn.Linear(layer_size, layer_size) for _ in range(layer_num)])
        self.relu = nn.ReLU()

    def forward(self, x, offload=False):
        if offload:
            prev_tensor = x
            for idx, layer in enumerate(self.layers):
                x = self.relu(layer(prev_tensor))
                prev_tensor.to('cpu', non_blocking=True)
                prev_tensor = x
        return x

Here is the test output:

Testing without offloading:
Memory allocated before forward pass: 2576.62 MB
Memory allocated after forward pass: 3224.75 MB
Max memory allocated during forward pass: 3240.75 MB

Testing with offloading:
Memory allocated before forward pass: 2584.75 MB
Memory allocated after forward pass: 3224.75 MB
Max memory allocated during forward pass: 3240.75 MB

There’s no significant reduction in GPU memory usage. However, with the following approach using torch.autograd.graph.save_on_cpu(), I do observe a noticeable decrease in memory usage:

class DeepModel(nn.Module):
    def __init__(self, layer_num=40, layer_size=4096):
        super(DeepModel, self).__init__()
        self.layers = nn.ModuleList([nn.Linear(layer_size, layer_size) for _ in range(layer_num)])
        self.relu = nn.ReLU()

    def forward(self, x, offload=False):
        if offload:
            with torch.autograd.graph.save_on_cpu(pin_memory=True):
                prev_tensor = x
                for idx, layer in enumerate(self.layers):
                    x = self.relu(layer(prev_tensor))
                    prev_tensor.to('cpu', non_blocking=True)
                    prev_tensor = x
        return x

Why doesn’t the first method reduce memory usage as expected? What could be causing this difference?