Does an optimizer deallocate the gradients after it updates the weights of model?

As title. During the backward pass, those tensors that require_gradient will be set so that their gradient can be used by the optimizer later on. Now I’m interested in whether or not an optimizer will deallocate the GPU memory of these gradients.

Yes, optimizer.zero_grad() will use set_to_none=True by default in newer PyTorch versions and will thus delete the .grad tensors to be able to reuse this memory.

1 Like

In my current code I wrote:

optimizer.zero_grad()
loss.backward()
optimizer.step()

would it be reasonable to change it to:

loss.backward()
optimizer.step()
optimizer.zero_grad()

if I want to clear the GPU memory usage of gradient after each training loop. It seems that the first version might cause the memory_allocated called in the forward pass to also record those grads. Am I correct?

Yes, if you delete the gradients late, i.e after the next forward pass was done, your memory footprint would be higher.

1 Like