Questions about GPU memory usage

rkd1137 · February 5, 2025, 12:48am

I’m trying to profile a model’s memory usage right now using this tutorial: Understanding GPU Memory 1: Visualizing All Allocations over Time | PyTorch. Looking at the output, almost all of the memory usage is listed as Unknown (screenshot attached). When I step through the code watching nvidia-smi, it looks like the biggest increase in memory comes during the forward pass of the model. Does anyone have suggestions on how to debug this further? I can post my code, but my model/dataset are spread over several files. Aside from the model/dataset import, I exactly follow the code in Appendix B of the link above. Is there a more methodical way to find out which parts of my model are leading to the largest costs in memory?

Similarly, here’s the screenshot showing memory by tensor. But as I don’t see an obvious way to associate that to specific tensors in the model, I’m not sure how to use this to debug further (I’m using four random batches with different sequence lengths).

anantguptadbl · February 5, 2025, 12:21pm

@rkd1137 there is no easy way to do this. Can share a few general pointers

There is a good article Understanding CUDA Memory Usage — PyTorch main documentation

Check if there are certain allocations that are continuous across the iterations. These are probably hogging memory without being part of the training cycle
If the stack shows *_backward they are part of backward propagation

image1156×117 59.7 KB
If the stack shows just a plain operation like convolution and increase upto a point they are part of the forward flow for each epoch

image1841×154 97.4 KB

python · May 8, 2025, 1:25pm

Yeah, the “Unknown” memory category can make profiling frustrating. One thing that might help is breaking down the forward pass into smaller chunks and using torch.cuda.memory_allocated() or torch.cuda.memory_snapshot() at different points to narrow down where the spikes are happening. Also, wrapping parts of your forward method with torch.autograd.profiler.profile(use_cuda=True) might give you better granularity. If your model is spread over multiple files, maybe try isolating smaller parts and running them with dummy inputs to see how much memory each one uses. That might help tie the large tensors back to specific layers or ops.