Why CUDA extensions use more gpu memory?

nobody · September 13, 2021, 3:36am

I used custom cuda extension to replace some parts of the model, and it worked correctly as expected. But training with cuda extension uses more GPU memory.
I used torch.cuda to analyze the pre-expansion and post-expansion pytorch GPU memory allocation. I found that in addition to the GPU memory I pre-allocated for cuda extensions using cuda_malloc, pytorch itself allocates more GPU memory.
Half of the over-allocated GPU memory is due to the explicit call torch.empty_like() in the forward and backward of the cuda extension to pre-allocate the space for storing the calculation results, and the other half is added when the optimizer.step() is called.
I need some help information to locate the problem and solve the problem of using the video memory.

Does the GPU memory statistics method (memory_allocated or memory_summary) in torch.cuda count the GPU memory allocated by me manually calling the cuda_malloc function? Do it count the GPU memory allocated by similar methods called torch::empty_like() in C++ code? My guess is that the former is not counted, but the latter is counted.
Will the GPU memory allocated by torch::empty_like() called in the custom forward and backward methods in the c++ code be automatically recycled when it is no longer used? If not, then how should I destory this part of the GPU memory by myself?
Why does the call of optimizer.step() lead to the increase of the GPU memory usage? I think this method is only to update the weights and should not allocate new space, but it does increase the GPU memory usage.

tom · September 13, 2021, 6:42am

This seems as you describe.
Memory is returned to the torch cache once the tensors using it go out of scope.
It depends on your optimizer, but things like tracking statistics for adaptivity (eg in Adam) or momentum use additional memory. These are allocated once, so it should not increase over time.

In general, I would advise to no use cuda_malloc directly but instead create a new torch tensor. As long as you keep a reference to it, the memory will be reserved for you and can be accessed via data_ptr() (for a void*) or data_ptr<T>() for a T*.

Best regards

Thomas

nobody · September 13, 2021, 7:55am

Regarding the third point, maybe my description above is not clear enough.

I compared the GPU memory usage at each step of the code that does not use the CUDA extension and the code that uses the CUDA extension. Both use the exact same optimizer. My cuda extension only replaces the layer modules in the model.

The former only adds about 5M of GPU memory usage during the first execution of optimizer.step(), while the latter adds about 1G of GPU memory usage. There is a huge difference between the two.

I don’t know why the CUDA extension causes such a difference? Can you provide some possible clues?

tom · September 13, 2021, 8:14am

I would not know why there would be such a difference. Might be something your extension does?

nobody · September 14, 2021, 2:36am

Thank you for your answer above. I found the reason. The optimizer uses different gradient calculation strategies for one-dimensional parameters, which requires more GPU memory space.