Does an optimizer deallocate the gradients after it updates the weights of model?

raining_day513 · May 16, 2023, 4:13pm

As title. During the backward pass, those tensors that require_gradient will be set so that their gradient can be used by the optimizer later on. Now I’m interested in whether or not an optimizer will deallocate the GPU memory of these gradients.

ptrblck · May 16, 2023, 6:12pm

Yes, optimizer.zero_grad() will use set_to_none=True by default in newer PyTorch versions and will thus delete the .grad tensors to be able to reuse this memory.

raining_day513 · May 16, 2023, 6:37pm

In my current code I wrote:

optimizer.zero_grad()
loss.backward()
optimizer.step()

would it be reasonable to change it to:

loss.backward()
optimizer.step()
optimizer.zero_grad()

if I want to clear the GPU memory usage of gradient after each training loop. It seems that the first version might cause the memory_allocated called in the forward pass to also record those grads. Am I correct?

ptrblck · May 16, 2023, 6:44pm

Yes, if you delete the gradients late, i.e after the next forward pass was done, your memory footprint would be higher.