I am currently using DDP (NCCL backend) to train a network on a machine with 8 GPUs.
I do a validation pass after each epoch, but I don’t want to do the same validation step on all 8 GPUs. So in order to only use one GPU for validation I am using torch.distributed.barrier(). But the process seems to hang up once it reaches the barrier statement.
Here is an example of the training loop:
for epoch in range(opt['epochs']):
#[1]
model.train()
#[2]
for batch_i, (imgs, targets) in enumerate(dataloader):
#[3]
imgs = Variable(imgs.cuda(gpu))
targets = Variable(targets.cuda(gpu), requires_grad=False)
loss, outputs = model(imgs, targets)
loss.backward()
optimizer.step()
optimizer.zero_grad()
scheduler.step()
if epoch % opt['evaluation_interval'] == 0 && gpu==0:
print("\n---- Evaluating Model ----")
evaluation_metrics = evaluate(model)
#[4]
I have tried to put the barrier statement in four different places (maked in the code as comments) and no matter where I put it, the code hangs once it reaches that point.
For the cases (1,2) the code executes well on the first pass, but after validation it hangs. For the case (3) the code never reaches that point after the validation pass. For the case (4) once the validation is done, it also hangs.
I have also tested running the validation on all GPUs without using the barrier, and it does not hang.
Does anyone have any idea on why this is happenning?
I read this other two posts: 1, 2. But I think that their problem is not similar to mine.
Thanks for reporting this issue! If barrier() is indeed not working properly, it seems like this is a bug and it would be great to create an issue over at http://github.com/pytorch/pytorch/issues/ with an example code snippet that reproduces the issue.
Although I’m a little confused, while the barrier at any of your points should work, I’m not sure how it helps you use only one GPU for validation? A barrier will just block all processes until all processes have entered the barrier.
I am also confused about this. My thought process is just that it seems like a waste of power to do the same validation step on all GPUs at the same time.
Right now, my validation is coded to be done by a single process (and by consequence a single GPU), so there wouldn’t be any performance gain running it across multiple GPUs.
I will try to reproduce the error and post an issue on Monday, since currently I don’t have access to the machine.
I guess in the end I will have to just run validation on all GPUs.
It would be best if you could provide a minimal example script in which you call barrier() as you expect to do here and it fails in the issue description. Thank you!