Memory_cached and memory_allocated does not nvidia-smi result

Hi, I tried to understand memory_cached and memory_allocated, in the doc, it is said that the unit is bytes.
I tried to match the results of torch.cuda.memory_cached() and torch.cuda.memory_allocated() with the results of nvidia-smi, but the matching failed, in both ipython and in my evaluation code for evaluating a model.

In Ipython3:

In [1]: a = torch.zeros(8192, 4, device='cuda')

In [2]: c = torch.ones(123456, 2, 3, device='cuda')

In [3]: torch.cuda.max_memory_cached(0)
Out[3]: 4063232

In [4]: torch.cuda.max_memory_allocated(0)
Out[4]: 4011008

In [5]: torch.cuda.memory_allocated(0)
Out[5]: 4011008

In [6]: torch.cuda.memory_cached(0)
Out[6]: 4063232

In [7]: (8192 * 4 + 123456 * 2 * 3) * 4
Out[7]: 3094016

while the number shown in nvidia-smi is always 1053 MB.

What causes the difference?
What’s the difference between max_memory_cached/allocated and memory_cached/allocated?
If I would like to measure the GPU memory usage by a model during evaluation, where should I put the evaluation code, after getting the output?

Your tensor is too small. The caching allocator allocates memory with a minimum size and a block size so it may allocate a bit more memory than the number of elements in your tensors.

Things like cuda ctx, cuda rng state, cudnn ctx, cufft plans, and other gpu memory your other libraries may use are not counted in the torch.cuda.* stats.

1 Like

The nvidia-smi number includes space the allocated by the CUDA driver when it loads PyTorch. This can be quite large because PyTorch includes many CUDA kernels. On a P100, I’ve seen an overhead of ~487 MB. On M40, I’ve seen ~303 MB.

This memory isn’t reported by the torch.cuda.xxx_memory functions because it’s not allocated by the program and there isn’t a good way to measure it outside of looking at nvidia-smi.

5 Likes

Hi, I tried the following commands;

torch.cuda.max_memory_allocated(device=0)

torch.cuda.max_memory_cached(0)

In both cases, 0 is returned. Can anyone explain what does that mean?

Previously when I ran my code the following error came.

RuntimeError: CUDA error: out of memory

If I run,
torch.cuda.get_device_properties(device).total_memory

I get output:34078654464

Are you printing these stats right before the OOM error is raised and are you sure you’ve already allocated some tensors on the specified device?