GPU memory estimation given a network

gaoking132 · April 7, 2017, 12:59am

Hi,

I am dealing with 3D image data. The bottleneck of network design is both GPU and CPU memory.

I try to estimate the GPU memory needed for a given network architecture. However, it seems that my estimation is always much lower than what the network actually consumes. In the following example,

import torch.nn as nn
import torch
from torch.autograd import Variable

net = nn.Sequential(
    nn.Conv3d(1, 16, 5, 1, 2),
)
net.cuda()

input = torch.FloatTensor(1, 1, 64, 128, 128).cuda()
input = Variable(input)
out = net(input)

print(out.size())

The actual GPU memory consumed is 448 MB if I add a break point in the last line and use nvidia-smi to check the GPU memory consumption. However, if I calculated manually, my understanding is that
the total consumed GPU memory = GPU memory for parameters x 2 (one for value, one for gradient) + GPU memory for storing forward and backward responses.

So the manual calculation would be 4MB (for input) + 64 MB x 2 (for forward and backward) + << 1MB (for parameters). It is roughly 132 MB. There is still a big gap from 132 MB to 448 MB. I don’t know what I am missing. Any idea on how to manually calculate the GPU memory required for a network?

ngimel · April 7, 2017, 1:15am

Couple hundred MB are usually taken just by initializing cuda. Look at your memory consumption after

a=torch.cuda.FloatTensor(1)

, that would give you the framework overhead.

gaoking132 · April 7, 2017, 1:30am

Thank you for the suggestion.

I found that 273 MB was used for initialized CUDA on my side.

After running this script, the program consumes 277 MB, which matches well with my calculation 273MB + 4MB (for input).

net = nn.Conv3d(1, 16, 3, 1, 1)
net.cuda()

input = torch.FloatTensor(1, 1, 64, 128, 128).cuda()
input = Variable(input)
# out = net(input)

However, if I remove the comment of the last line, the program consumes 436 MB. There are 159 MB more memory consumed. However, if I calculate the size of output (16x64x128x128 x 4 bytes x 2) = 128 MB. There is still a not small gap there. Does anybody know why? Where are these additional memory consumed?

jekbradbury · April 7, 2017, 1:50am

Most likely (IIRC) this is workspace used by the convolution kernel; the way PyTorch allocates memory it will continue to leave blocks marked as in use from the perspective of nvidia-smi even if it’s no longer using them internally. This is because CUDA’s malloc and free functions are quite slow, and it’s much more efficient to cache allocated blocks in a free list. When the device runs out of memory, PyTorch will call CUDA’s free function on all free blocks and the memory usage seen by nvidia-smi will fall.

gaoking132 · April 7, 2017, 1:53am

Thanks for the detailed answer. So besides running a network and checking nvidia-smi, is there any other principled way to estimate the GPU memory usage without running the network?

jekbradbury · April 7, 2017, 2:03am

If you don’t use CUDNN, then likely yes (most operations won’t use any scratchpad space, and those that do will allocate a deterministic amount that you can find in the code). But CUDNN contains many different algorithms and implementations for each operation (conv, pool, RNN, etc) with different memory requirements, and which algorithm is chosen depends in a complicated way on the sizes of all the inputs and the values of CUDNN flags. The memory usage that you’ve computed is probably accurate if you don’t count the cached free blocks, so if you’re trying to fit a network in a particular device with a given amount of memory that may be all you need to do.

jacobkimmel · November 18, 2017, 8:41am

To any future readers:

I implemented a quick tool to automate memory size estimation based on the approach above.

As this discussion outlines, note that these size estimations are only theoretical estimates, with implementation details altering the exact model size in practice!

Hope someone finds it useful

alexgo · November 10, 2020, 11:12am

Thanks for this suggestion.
Any thoughts as to why this overhead might be a lot more than a couple hundred MB?
I checked this now exactly as you suggested with allocating a unit sized tensor, and in my case it seems to be 1229 [MB]! This is clearly too much.
I checked right before the allocation and it was close to zero, and just after allocating this unit tensor, it jumped to 1229 [MB].

I’m using Pytorch v1.7, Cuda 10.1 and Tesla V100 GPU.