I used custom cuda extension to replace some parts of the model, and it worked correctly as expected. But training with cuda extension uses more GPU memory.
torch.cuda to analyze the pre-expansion and post-expansion pytorch GPU memory allocation. I found that in addition to the GPU memory I pre-allocated for cuda extensions using
cuda_malloc, pytorch itself allocates more GPU memory.
Half of the over-allocated GPU memory is due to the explicit call
torch.empty_like() in the
backward of the cuda extension to pre-allocate the space for storing the calculation results, and the other half is added when the
optimizer.step() is called.
I need some help information to locate the problem and solve the problem of using the video memory.
- Does the GPU memory statistics method (
torch.cudacount the GPU memory allocated by me manually calling the
cuda_mallocfunction? Do it count the GPU memory allocated by similar methods called
torch::empty_like()in C++ code? My guess is that the former is not counted, but the latter is counted.
- Will the GPU memory allocated by
torch::empty_like()called in the custom forward and backward methods in the c++ code be automatically recycled when it is no longer used? If not, then how should I destory this part of the GPU memory by myself?
- Why does the call of
optimizer.step()lead to the increase of the GPU memory usage? I think this method is only to update the weights and should not allocate new space, but it does increase the GPU memory usage.