Training gets stuck

Hi, all,

I am the new user of the Pytorch. And I met the following problem:

My training code gets stuck after tens of iteration steps (it does not iterate anymore after hours waiting).

Then I use Ctrl+C to stop the training, it does not stop the code.

And I use nvidia-smi to see the GPU use, the GPU is still occupied and doing computation.

Is there anyone knowing the reason?

1 Like

Same problem.

Waiting for solution

Could you post a (small) executable code snippet so that we could debug the issue?
Also, are you using multiple workers in your DataLoader? If so, does your code run using num_workers=0?

Thanks for your reply.

My problem was solved by fixing a bug: I replaced zero by torch.zeros_like() when initializing a tensor.

The num_workers, in my case, was 16 and I did not change it even when the problem was solved. I have tried set it to zero but same problem still happened, therefore I think that problem may be caused by other potential bugs.

ptrblck via PyTorch Forums 于2019年4月10日周三 下午4:19写道:

1 Like