Training Job Stalls with no Logs & GPU Usage Spike

ptrblck · August 4, 2020, 10:29am

The original post was concerned about a hanging script, while the GPU utilization was at a high level.
Your issue seems to be that you are running out of memory.
Since you are seeing an increased memory usage in each epoch, you might accidentally store some tensors, which are still attached to the computation graph, such as the loss or output, which will store the whole computation graph including all intermediate tensors.
If you want to store tensors for debugging or logging purposes, you should .detach() them or call .item() to get the Python value.

ultramarine · August 4, 2020, 1:06pm

GPU memory seems to be the same for almost 10 epochs and then it starts spiking. Is that possible when each epoch is essentially the same? Could it be the case that GPU has a large initial memory allocated and the actual memory required has always been growing slowly over each epoch?

ptrblck · August 5, 2020, 7:09am

That might be the case and you would see some bumps in the memory usage for every increase in the reserved memory allocation.
Did you find any suspicious calls, which might append non-detached tensors to a list or another container?

ultramarine · August 5, 2020, 1:55pm

Yes. Found a non-detached tensor getting accumulated.

kaipakiran · February 3, 2021, 3:19am

Hi @ptrblck - is there a fix for this problem? I am facing the same issue with pytorch 1.7.1+cu101 and pytorch-lightning 1.1.6. CUDA version 10.1. I have also tried on cuda 10.2 but face the same issue when I am using ddp in pytorch-lightning using >1 GPU. Any pointers that can help? I too am not in a position to share code here.

ptrblck · February 3, 2021, 4:08am

One common error is to store non-detached tensors in e.g. a list, which would hold a reference to the computation graph, and PyTorch will thus not be able to clean it.
Based on the last comment from @ultramarine this was also the error in this thread.

Jake_Williams · February 3, 2021, 1:06pm

Just FYI I never found a solution and decided to deal with failing trainings 50% of the time. LMK if you find something that works.

Thanks for the suggestion patrick.

kaipakiran · February 3, 2021, 3:52pm

Thanks Patrick & James… I double checked my code and I’m already taking care of adding .item() to all metrics I’m logging. The only variable I’m returning without this is the loss.
Best,
Kiran