I am currently using DDP (NCCL backend) to train a network on a machine with 8 GPUs.
I do a validation pass after each epoch, but I don’t want to do the same validation step on all 8 GPUs. So in order to only use one GPU for validation I am using torch.distributed.barrier(). But the process seems to hang up once it reaches the barrier statement.
Here is an example of the training loop:
for epoch in range(opt['epochs']):
for batch_i, (imgs, targets) in enumerate(dataloader):
imgs = Variable(imgs.cuda(gpu))
targets = Variable(targets.cuda(gpu), requires_grad=False)
loss, outputs = model(imgs, targets)
if epoch % opt['evaluation_interval'] == 0 && gpu==0:
print("\n---- Evaluating Model ----")
evaluation_metrics = evaluate(model)
I have tried to put the barrier statement in four different places (maked in the code as comments) and no matter where I put it, the code hangs once it reaches that point.
For the cases (1,2) the code executes well on the first pass, but after validation it hangs. For the case (3) the code never reaches that point after the validation pass. For the case (4) once the validation is done, it also hangs.
I have also tested running the validation on all GPUs without using the barrier, and it does not hang.
Does anyone have any idea on why this is happenning?
I read this other two posts: 1, 2. But I think that their problem is not similar to mine.
Thanks for reporting this issue! If barrier() is indeed not working properly, it seems like this is a bug and it would be great to create an issue over at http://github.com/pytorch/pytorch/issues/ with an example code snippet that reproduces the issue.
Although I’m a little confused, while the barrier at any of your points should work, I’m not sure how it helps you use only one GPU for validation? A barrier will just block all processes until all processes have entered the barrier.