GPU VRAM not getting cleared

I am trying to implement a federated learning experiment, where there is one server and 10 clients. For each client, I am using a copy of the ResNet18 model and a copy of the SGD optimizer. And another copy is in the server. I keep the client models and optimizers in a list like this

for client_idx in range(client_num):
    net_current = copy.deepcopy(net)
    optimizer = torch.optim.Adam(net_current.parameters(), lr=lr, betas=(0.9, 0.999))
    scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma = 0.95, last_epoch = -1)
    net_clients.append(net_current.to('cpu'))
    optimizer_clients.append(optimizer)
    scheduler_clients.append(scheduler)

For each client, I run the local training,

for client_idx in source_site_idx:
dataloader_current = dataloader_clients[client_idx]
net_current.to(device=GPUdevice)
net_current.train()
optimizer_current = optimizer_clients[client_idx] #.to(GPUdevice)
scheduler_current = scheduler_clients[client_idx]

After training is done, I assign None to both net_current and optimizer_current

I want to clear the GPU VRAM being occupied by the current client model; however, I observe that the GPU VRAM that gets occupied by the client model does not get cleared if I assign None to both net_current and optimizer_current.

I have also tried

net_clients[client_idx].cpu()

But GPU VRAM does not get cleared.

What can be done to clear the GPU VRAM by (maybe) offloading the model weights to CPU?

The memory might be held onto by the caching allocator, does torch.cuda.memory.empty_cache — PyTorch 2.9 documentation help?

@mikaylagawarecki

I tried adding torch.cuda.memory.empty_cache()

After I set net_current = None and optimizer_current = None

But it seems it did not work. The memory is still occupied.

Hi Siladittya!

At this point optimizer holds references to the various net_current.parameters() and
scheduler holds a reference to optimizer.

These lists now hold references to the various net_currents, optimizers, and
schedulers. In particular, scheduler_clients, through the chain of references, holds
references to the various net_current.parameters().

Note that doing this doesn’t do anything to the python objects referenced by net_current
and optimizer_current (other than setting the python names net_current and
optimizer_current to no longer refer to those objects).

When no python names refer any longer to some object, that object becomes available
for garbage collection.

So you need to set all of net_clients, optimizer_clients, and scheduler_clients
to None to release all of those chains of references so that the underlying
net_current.parameters() become available for deletion by the python garbage
collector (which then makes the associated gpu memory available for reuse).

A couple of questions if you still have problems with this: What version of pytorch are you
using? What tool are you using to tell you that gpu memory is in use? Even is you think
that (a lot) of gpu memory is in use, can you still instantiate a (large) gpu tensor or does
pytorch give you an out-of-memory error?

Best.

K. Frank

1 Like

@KFrank

I am using PyTorch 2.7.1+cu128 with Python 3.13.

I am using nvidia-smi to see if the GPU memory is occupied.

I don’t want to delete the contents of the list net_clients to None, as I would be needing them in the next epoch again in my training, but I can set optimizer_clients and scheduler_clients to zero.

I was wondering if transferring the contents of net_clients to cpu and setting optimizer_clients and scheduler_clients to None would make the underlying net_current.parameters() available for garbage collection?Like if I do

net_current = None
optimizer_current = None
scheduler_current = None
net_clients[client_idx].cpu()

EDIT:

From the start, the command nvidia-smi is showing that out of 16GB, almost 15.5GB is occupied. When a new client model starts training, only a small amount of occupied VRAM increases. I tried with a smaller Batch Size and found a similar behaviour. initial occupied VRAM = 9.3GB, with minor increments for the next models, which means it is working.

So, it may not be needed to set net_clients = None, instead net_clients.cpu() works, but thanks for pointing in the right direction @KFrank

Hi Siladittya!

That should be fine.

This is a perfectly reasonable use case, but you will, of course, need enough gpu memory
to store the net_clients from epoch to epoch.

As long as you don’t have other references to the optimizers and schedulers, this will
let the garbage collector reclaim the associated memory.

This should work (although you will have to have enough cpu memory to hold the
net_clients and there will be some run-time cost in moving the tensors out of gpu
memory and back).

Note, although the net_clients are likely to account for the bulk of your gpu memory,
it’s possible that freeing all the other memory that you don’t need to preserve from
epoch to epoch could give you enough space that you can leave the net_clients in
gpu memory.

Just to be clear, this won’t do it. As written, this will let the system only reclaim the memory
for a single optimizer and scheduler. You need to set optimizer_clients and
scheduler_clients to None.

From the start of what?

When I launch python and import torch, nvidia-smi shows no gpu memory usage
(attributed to the python process). When I then instantiate a one-element gpu tensor,
nvidia-smi shows 148 MiB being used by python (because pytorch loads some cuda
libraries or something when you do anything with gpu tensors).

So, yes, you definitely want to figure out what is consuming that memory.

This makes it sound like you are loading a large dataset (or at least large batches of data)
into your gpu memory. If you need the memory for other things, you can set to None the
references to the gpu tensors, at the cost of copying the data back to the gpu when you
need it again later. If you are copying in the data off disk, you could presumably cache it
in cpu memory and copy it the gpu as needed. You’d still have the cost of a cpu-gpu copy,
but you would avoid the redundant slow disk reads.

Best.

K. Frank