Hi, I tried to understand memory_cached and memory_allocated, in the doc, it is said that the unit is bytes.
I tried to match the results of torch.cuda.memory_cached() and torch.cuda.memory_allocated() with the results of nvidia-smi, but the matching failed, in both ipython and in my evaluation code for evaluating a model.
In Ipython3:
In [1]: a = torch.zeros(8192, 4, device='cuda')
In [2]: c = torch.ones(123456, 2, 3, device='cuda')
In [3]: torch.cuda.max_memory_cached(0)
Out[3]: 4063232
In [4]: torch.cuda.max_memory_allocated(0)
Out[4]: 4011008
In [5]: torch.cuda.memory_allocated(0)
Out[5]: 4011008
In [6]: torch.cuda.memory_cached(0)
Out[6]: 4063232
In [7]: (8192 * 4 + 123456 * 2 * 3) * 4
Out[7]: 3094016
while the number shown in nvidia-smi is always 1053 MB.
What causes the difference?
What’s the difference between max_memory_cached/allocated and memory_cached/allocated?
If I would like to measure the GPU memory usage by a model during evaluation, where should I put the evaluation code, after getting the output?
Your tensor is too small. The caching allocator allocates memory with a minimum size and a block size so it may allocate a bit more memory than the number of elements in your tensors.
Things like cuda ctx, cuda rng state, cudnn ctx, cufft plans, and other gpu memory your other libraries may use are not counted in the torch.cuda.* stats.
The nvidia-smi number includes space the allocated by the CUDA driver when it loads PyTorch. This can be quite large because PyTorch includes many CUDA kernels. On a P100, I’ve seen an overhead of ~487 MB. On M40, I’ve seen ~303 MB.
This memory isn’t reported by the torch.cuda.xxx_memory functions because it’s not allocated by the program and there isn’t a good way to measure it outside of looking at nvidia-smi.