`zero_grad()` release GPU memory

qiminchen · February 6, 2023, 4:01am

I have two questions about zero_grad() releasing GPU memory

Does net.zero_grad() release the GPU memory occupied by the gradients computed from previoue epoch?
Suppose I have two networks netD and netG, in the below code snippet, can I add extra zero_grad() after optimizer.step to release some GPU memory before going to the next epoch? I could zero grad both netD and netG at the very beginning of loop but what if I want to save some GPU memory for netG after training netD.

for epoch in epochs:

    # train discriminator
    netD.zero_grad()
    ...
    loss_d.backward()
    optimizer_d.step()
    # netD.zero_grad()  <------- Can I add another zero_grad() here
                                 to release some GPU memory?
    ...
    ...

    # train generator
    netG.zero_grad()
    ...
    loss_g0.backward()
    optimizer_g.step()
    # netG.zero_grad()  <------- Can I add another zero_grad() here
                                 to release some GPU memory?
    ...
    ...

    # train generator again using other losses
    netG.zero_grad()
    ...
    loss_g1.backward()
    optimizer_g.step()
    # netG.zero_grad()  <------- Can I add another zero_grad() here
                                 to release some GPU memory?

ptrblck · February 6, 2023, 4:41am

Not in the default .zero_grad() call, since the .grad attributes will just be filled with zeros, but not deleted. To delete the .grad attributes and to save memory, you would need to call .zero_grad(set_to_none=True).
Yes, if you don’t need the gradients from a previous model anymore you can directly delete them after they were used in optimizer.step().

qiminchen · February 6, 2023, 5:01am

Thank you! Does calling .zero_grad(set_to_none=True) slow down the training?

ptrblck · February 6, 2023, 7:24am

No, it does not slow down the training, as it’s only freeing the gradients. The next gradient calculation should be able to just reuse the memory via the CUDA caching allocator.