Are the gradients of parameters in nn.Module all set to `None` and thus not allocated by default?

raining_day513 · May 16, 2023, 3:51pm

I want to know when PyTorch will allocate GPU memory for all the gradients of a given nn.Module. It seems that all these gradients are set to None before training, and they will be set during the backward pass.

Setting to None or not, it seems that PyTorch already allocate GPU memory for all of them. Is this claim correct?

How can I change this behaviour so that I can decide when the GPU memory will be allocated for the gradients that are about to be computed.

ptrblck · May 17, 2023, 1:26am

No, that’s wrong as PyTorch will not allocate memory for non-existing objects.

raining_day513 · May 17, 2023, 2:21am

I observed that there was not only one (weight) but three times (weight + gradient * 2) memory usage during forward, and my conclusion at that time was that it was caused by (1) the momentum mechanism of the optimizer that requires the gradients to be stored from the previous training loop (previous batch), and (2) the fact that PyTorch might have allocated all .grads of a nn.Module after the instance is created. Based on your answer here, is my conclusion completely wrong? I also believe that this has something todo with the ordering of code I mentioned in another thread yesterday.

Could you help me confirm this, as I might need to report this back to my teammate?

Thanks for your reading.

ptrblck · May 17, 2023, 6:16am

Yes, since None objects are not taking up GPU memory.
Of course if you do not delete the .gradattributes they will still be in memory and the peak memory usage will increase as described in the other thread.
However, here you claim that PyTorch allocates memory for gradients even if these are non-existent, which is wrong.

You would also have to check the size of the intermediate tensors created during the forward pass, which would also use device memory until they are freed during the backward pass.

raining_day513 · May 17, 2023, 6:57am

Firstly, apologize for the wrong claim. I’m learning this. I was not sure whether these gradients exist or not after the model instance is created.

I didn’t mention that I implemented the gradient-checkpointing technique(or activation checkpointing since the word checkpoint has other meanings) using torch.autograd.Function. By this, I think the intermediate tensors are not involved in the problem.

ptrblck · May 17, 2023, 7:05am

They don’t and you can check if my accessing the .grad attribute of any parameter which will return None.

Intermediate activation tensors are stored by default during the forward pass as they are needed to calculate the gradients in the backward pass. Activation checkpointing would recompute these activations to save memory and since you aren’t using it Autograd will still store the intermediates thus increasing the memory usage.

raining_day513 · May 17, 2023, 7:10am

Could you elaborate more on this part? Did you mean that the intermediate data will still be stored even if I put the computation inside the block of torch.no_grad() when defining the static method backward of torch.autograd.Function? I did implement activation checkpointing in the project.

ptrblck · May 17, 2023, 7:15am

No, no_grad will prevent storing intermediates, but in case you are using custom autograd.Functions you would need to explicitly store the activations via ctx.save_for_backward. Otherwise you won’t be able to compute gradients.

raining_day513 · May 17, 2023, 8:01am

@ptrblck Lol, I think my bad grammar caused the confusion. I hope you haven’t read my reply above(already deleted). I was trying to say:

I did implement gradient-checkpointing so the issue should not be related to intermediate data in my case.
I forgot to mention point 1.

I just realized that my original sentence sounds like:

I didn’t implement something.
The intermediate data was not involved because of 1.

Really sorry for this if it causes confusion!

Back to the title, if I’m very sure that I have implemented gradient-checkpointing but I still saw some abnormal GPU memory caching(from nvidia-smi) that is much larger than what I calculate for my model, what could be the cause? This is why I made some (probably wrong) assumptions that PyTorch didn’t allocate the GPU memory efficiently.