How to free GPU memory Changing Architectures While Training

jnx · January 21, 2020, 12:57pm

Hello Guys,

I need your help.

I only have one GPU installed in my computer. I am currently working on architecture selection hence I test different architectures (e.g. 10 architectures) with one run. I do it one by one.

for i in Architectures:
    model = network(architecture = i)
    ....

    for mini_epoch in Total_epochs:
        ##trains the architecture
        loss = model(input)
        loss.backward()
        optimizer.step()

My problem is that something is accumulating in the GPU memory that I cannot finish training all the architectures even though I treat them as a single model at a time. Testing the architectures one at a time works but it is too tedious. Is there a way to release GPU memory when a new architecture is passed as new model?

Thanks in advance!

albanD · January 21, 2020, 3:48pm

Hi,

Two things:

We have a custom allocator, so even when the memory is released, you won’t see it available on nvidia-smi but you will be able to use it in pytorch.
The memory is realeased only when you don’t reference it anymore. You might want to wrap the content of your inner loop in a function so that all the intermediary results go out of scope (and are thus released) between loop iterations.

jnx · January 22, 2020, 12:50pm

The second bullet does the job for me. Thanks!

hadaev8 · January 22, 2020, 6:20pm

If i want to remove module of model, can i just do model.layer = None? Or del model.layer

albanD · January 22, 2020, 7:44pm

Both will work.
The only difference is that if you access it later, in one case your will get None and in the other case you will get an error saying that there is no attribute with this name.

hadaev8 · February 7, 2020, 4:38pm

But how to remove param from optimizer?

albanD · February 7, 2020, 4:44pm

That would be trickier, I don’t think our optimizer API supports removing parameters.
That being said, if the .grad field of that Tensor is None. Then the optimizer will just ignore it. And because it is not in the network anymore, it won’t be updated. So you can leave it in the optimizer.

xin_Wang1 · April 3, 2021, 1:32pm

I also have same problem in my code. Luckily, I found the method to solve it.
you have to set torch.backends.cudnn.benchmark=False