As title. During the backward pass, those tensors that require_gradient
will be set so that their gradient can be used by the optimizer later on. Now I’m interested in whether or not an optimizer will deallocate the GPU memory of these gradients.
Yes, optimizer.zero_grad()
will use set_to_none=True
by default in newer PyTorch versions and will thus delete the .grad
tensors to be able to reuse this memory.
1 Like
In my current code I wrote:
optimizer.zero_grad()
loss.backward()
optimizer.step()
would it be reasonable to change it to:
loss.backward()
optimizer.step()
optimizer.zero_grad()
if I want to clear the GPU memory usage of gradient after each training loop. It seems that the first version might cause the memory_allocated
called in the forward pass to also record those grad
s. Am I correct?
Yes, if you delete the gradients late, i.e after the next forward pass was done, your memory footprint would be higher.
1 Like