Training get stuck at some iteration step

I’m using PyTorchv1.1.0 and DistributedDataParallel to train some models.

The training process will get stuck at some constant steps.

I noticed some wired things:

  1. The Memory usage keeps freeze.
  2. GPU usage increases to 100%. But the temperature of GPU isn’t very high.
  3. The CPU usage of 4 main progress is 100%

I think it is not a code bug. Because I use some different models and they both get stuck at some iteration step.

3 Likes

I met this problem before, then I installed pytorch nightly (version: 1.2.0.dev20190628) and the problem seems been solved.

1 Like

@GeoffreyChen777 I have exactly the same problem. Did you solve your problem? If yes, could you please share your solution?
I tried pytorch nightly as suggested by @Mr.Z , but it didn’t solve my problem.

Hi, I remember that no idea about why it happens when I use PyTorch 1.1.0.

In the past year, I never met this with newer version of PyTorch. Maybe you can try the newest.

I tried with pytorch 1.6.0, 1.7.0, 1.8.0, and 1.9.0. All have the same problem.

Hi @ukemamaster , did you find a solution now? I think I am running into the same problem.

@breakds Somehow yes. In my case it was I/O (data loading) problem. I was reading my training data from HDD on the fly (reading raw audio files, plus data augmentation/pre-processing) which was the very slow. When i put my dataset on SSD, everything was fine.

1 Like