Link between require_grad and moving tensors between CPU and GPU memory

Hi!

I am moving tensors between the CPU and GPU memory with .to(device) and .cpu().
I found out that all tensor that get in or out of the nn.Linear layer are locked in GPU memory.
That is why I created my own Linear layer and I found out that if require_grad=False I get the expected use of memory in the GPU.
If require_grad=True, then I cannot swap the tensors to CPU memory.
.cpu() or deleting the reference with del tensor_var have no impact if require_grad=True.

Can someone explain the background to this behaviour and maybe suggest a workaround to be able to use Pytorch’s autograd?

A huge thank you in advance for every answer!

Could you explain what “locked in GPU memory” means and how you are measuring it?
Note that PyTorch caches the GPU memory, so after deleting a tensor, the memory will not be immediately released to avoid synchronizing (and thus expensive) memory reallocations.

2 Likes

To measure the GPU memory usage, I use

stats = torch.cuda.memory_stats()
current_active_byte =  stats["active_bytes.all.current"]

and calculate the difference before and after the statement of interest.
The value of current_active_byte does not decrease with tensor_var.cpu() or del tensor_var if require_grad=True.

I cannot reproduce this behavior using this code snippet:

def print_active_bytes():
    stats = torch.cuda.memory_stats()
    current_active_byte =  stats["active_bytes.all.current"]
    print(current_active_byte)


# initial usage
print_active_bytes()
> 0

# vanilla tensor
x = torch.randn(1024, device='cuda')
print_active_bytes()
> 4096

del x
print_active_bytes()
> 0

# requires_grad=True
x = torch.randn(1024, device='cuda', requires_grad=True)
print_active_bytes()
> 4096

del x
print_active_bytes()
> 0

I can reproduce it with the following code

import torch

def active_bytes():
    stats = torch.cuda.memory_stats()
    current_active_byte =  stats['active_bytes.all.current']
    return current_active_byte


# initial usage
print("Init usage {}". format(active_bytes()))

# vanilla tensor
x = torch.randn((256, 128), device='cuda')
w = torch.randn((128, 512), device='cuda')
l = torch.matmul(x, w)
print("Vanilla tensor {}". format(active_bytes()))

del x
print("Vanilla tensor: del x {}". format(active_bytes()))
del w
print("Vanilla tensor: del w {}". format(active_bytes()))
l = l.cpu()
print("Vanilla tensor: l = l.cpu() {}". format(active_bytes()))

# requires_grad=True
x = torch.randn((256, 128), device='cuda', requires_grad=True)
w = torch.randn((128, 512), device='cuda', requires_grad=True)
l = torch.matmul(x, w)
print("requires_grad=True {}". format(active_bytes()))

del x
print("requires_grad=True: del x {}". format(active_bytes()))
del w
print("requires_grad=True: del w {}". format(active_bytes()))
l = l.cpu()
print("requires_grad=True: l = l.cpu() {}". format(active_bytes()))

The output I get is

Init usage 0
Vanilla tensor 917504
Vanilla tensor: del x 786432
Vanilla tensor: del w 524288
Vanilla tensor: l = l.cpu() 0
requires_grad=True 917504
requires_grad=True: del x 917504
requires_grad=True: del w 917504
requires_grad=True: l = l.cpu() 393216

In my “production” code .cpu() does not free up memory as it does in this example, but I did not come up with a code snippet to show that behaviour yet.

This is expected, as moving l to the CPU won’t detach it from the computation graph.

Add

l = l.detach()
print("requires_grad=True: l = l.cpu() {}". format(active_bytes()))

after the cpu() operation, the graph should be freed, and you should get 0 active bytes again.

Note that cuda(), cpu() and to() operations are differentiable, so that the computation graph will be kept alive and you can backpropagate through these operations.

1 Like

If I understood it correctly, when I would detach the tensor, then backward() does not work anymore and then there would no point of using Pytorch anymore.

Additionally, changing the second part to

# requires_grad=True
x = torch.randn((256, 128), device='cuda', requires_grad=True)
w = torch.randn((128, 512), device='cuda', requires_grad=True)
l = torch.matmul(x, w)
print("requires_grad=True {}". format(active_bytes()))

x = x.cpu()
print("requires_grad=True: x = x.cpu() {}". format(active_bytes()))
w = w.cpu()
print("requires_grad=True: w = w.cpu() {}". format(active_bytes()))
l = l.cpu()
print("requires_grad=True: l = l.cpu() {}". format(active_bytes()))

with the output

requires_grad=True 917504
requires_grad=True: x = x.cpu() 917504
requires_grad=True: w = w.cpu() 917504
requires_grad=True: l = l.cpu() 393216

shows that the inputs to the matrix multiplication are not freed.

The tensors should stay part of the computation graph so that backpropagation works.
The question is why does Pytorch not free the memory in the GPU for tensors that are part of the computation graph?
The tensors are not lost, they are just stored somewhere else.

PyTorch doesn’t free these tensors because they are needed for the backward pass to work properly, since they are part of the computation graph.
In your example you are creating new tensors x, w, and l, which doesn’t delete the computation graph. I.e. l.mean().backward() would still work.
You won’t be able to access the .grad attribute from x and w anymore, but the gradients will still be calculated. If you change the CPU tensors to e.g. x1 and w1, you would be able to print x.grad and w.grad after the backward pass.

You cannot move the computation graph to the CPU after its creation by moving leaf variables to the CPU (at least I’m not aware of a method to do so).

Alright, that’s not what I have hoped for :slight_smile: but thank you a lot for your help!