Training gets stuck

Ian · December 4, 2018, 11:14am

Hi, all,

I am the new user of the Pytorch. And I met the following problem:

My training code gets stuck after tens of iteration steps (it does not iterate anymore after hours waiting).

Then I use Ctrl+C to stop the training, it does not stop the code.

And I use nvidia-smi to see the GPU use, the GPU is still occupied and doing computation.

Is there anyone knowing the reason?

ZT_Tian · April 9, 2019, 12:29pm

Same problem.

Waiting for solution

ptrblck · April 10, 2019, 8:09am

Could you post a (small) executable code snippet so that we could debug the issue?
Also, are you using multiple workers in your DataLoader? If so, does your code run using num_workers=0?

ZT_Tian · April 21, 2019, 11:18am

Thanks for your reply.

My problem was solved by fixing a bug: I replaced zero by torch.zeros_like() when initializing a tensor.

The num_workers, in my case, was 16 and I did not change it even when the problem was solved. I have tried set it to zero but same problem still happened, therefore I think that problem may be caused by other potential bugs.

ptrblck via PyTorch Forums noreply@discuss.pytorch.org 于2019年4月10日周三下午4:19写道：