Training get stuck at some iteration step

GeoffreyChen777 · June 19, 2019, 1:17am

I’m using PyTorchv1.1.0 and DistributedDataParallel to train some models.

The training process will get stuck at some constant steps.

I noticed some wired things:

The Memory usage keeps freeze.
GPU usage increases to 100%. But the temperature of GPU isn’t very high.
The CPU usage of 4 main progress is 100%

I think it is not a code bug. Because I use some different models and they both get stuck at some iteration step.

Mr.Z · July 1, 2019, 4:05am

I met this problem before, then I installed pytorch nightly (version: 1.2.0.dev20190628) and the problem seems been solved.

ukemamaster · June 15, 2021, 4:10pm

@GeoffreyChen777 I have exactly the same problem. Did you solve your problem? If yes, could you please share your solution?
I tried pytorch nightly as suggested by @Mr.Z , but it didn’t solve my problem.

GeoffreyChen777 · June 15, 2021, 5:48pm

Hi, I remember that no idea about why it happens when I use PyTorch 1.1.0.

In the past year, I never met this with newer version of PyTorch. Maybe you can try the newest.

ukemamaster · June 16, 2021, 7:08am

I tried with pytorch 1.6.0, 1.7.0, 1.8.0, and 1.9.0. All have the same problem.

breakds · December 8, 2021, 7:00am

Hi @ukemamaster , did you find a solution now? I think I am running into the same problem.

ukemamaster · December 9, 2021, 10:49am

@breakds Somehow yes. In my case it was I/O (data loading) problem. I was reading my training data from HDD on the fly (reading raw audio files, plus data augmentation/pre-processing) which was the very slow. When i put my dataset on SSD, everything was fine.