The actual GPU memory consumed is 448 MB if I add a break point in the last line and use nvidia-smi to check the GPU memory consumption. However, if I calculated manually, my understanding is that
the total consumed GPU memory = GPU memory for parameters x 2 (one for value, one for gradient) + GPU memory for storing forward and backward responses.
So the manual calculation would be 4MB (for input) + 64 MB x 2 (for forward and backward) + << 1MB (for parameters). It is roughly 132 MB. There is still a big gap from 132 MB to 448 MB. I don’t know what I am missing. Any idea on how to manually calculate the GPU memory required for a network?
However, if I remove the comment of the last line, the program consumes 436 MB. There are 159 MB more memory consumed. However, if I calculate the size of output (16x64x128x128 x 4 bytes x 2) = 128 MB. There is still a not small gap there. Does anybody know why? Where are these additional memory consumed?
Most likely (IIRC) this is workspace used by the convolution kernel; the way PyTorch allocates memory it will continue to leave blocks marked as in use from the perspective of nvidia-smi even if it’s no longer using them internally. This is because CUDA’s malloc and free functions are quite slow, and it’s much more efficient to cache allocated blocks in a free list. When the device runs out of memory, PyTorch will call CUDA’s free function on all free blocks and the memory usage seen by nvidia-smi will fall.
If you don’t use CUDNN, then likely yes (most operations won’t use any scratchpad space, and those that do will allocate a deterministic amount that you can find in the code). But CUDNN contains many different algorithms and implementations for each operation (conv, pool, RNN, etc) with different memory requirements, and which algorithm is chosen depends in a complicated way on the sizes of all the inputs and the values of CUDNN flags. The memory usage that you’ve computed is probably accurate if you don’t count the cached free blocks, so if you’re trying to fit a network in a particular device with a given amount of memory that may be all you need to do.
Thanks for this suggestion.
Any thoughts as to why this overhead might be a lot more than a couple hundred MB?
I checked this now exactly as you suggested with allocating a unit sized tensor, and in my case it seems to be 1229 [MB]! This is clearly too much.
I checked right before the allocation and it was close to zero, and just after allocating this unit tensor, it jumped to 1229 [MB].
I’m using Pytorch v1.7, Cuda 10.1 and Tesla V100 GPU.