Understanding GPU memory usage



I’m trying to investigate the reason for a high GPU memory usage in my code.

For that, I would like to list all allocated tensors/storages created explicitly or within autograd. The closest thing I found is Soumith’s snippet to iterate over all tensors known to the garbage collector.

However, there has to be something missing… For example, I run python -m pdb -c continue to break at a cuda out of memory error (with or without CUDA_LAUNCH_BLOCKING=1). At this time, nvidia-smi reports around 9 GB being occupied. In the snipped I sum .numel()s of all tensors found and I get 17092783 elements, which with max size of 8 B per element gives ~130 MB. In the list, I find especially many autograd Variables (intermediate computations) missing. Can anyone give me a hint? Thanks!



it’s possible that these references to Variables are alive, but not in Python. These buffers can be of Functions who did save_for_backward of inputs which they need for gradient, and some Variable somewhere is alive in your code that is holding a reference to the graph that has all these buffer references alive.


Thanks for the clarification! So is there then any way how to enumerate all these Tensors, ideally by somehow querying the memory manager? One could also traverse the autograd graph similar as in Sergey’s visualization code, this should give me saved Tensors as well, but will that be complete?