Torch.distributed.barrier() hangs in DDP

Manuel_Alejandro_Dia · March 11, 2021, 11:04pm

Hi everyone.

I am currently using DDP (NCCL backend) to train a network on a machine with 8 GPUs.
I do a validation pass after each epoch, but I don’t want to do the same validation step on all 8 GPUs. So in order to only use one GPU for validation I am using torch.distributed.barrier(). But the process seems to hang up once it reaches the barrier statement.

Here is an example of the training loop:

for epoch in range(opt['epochs']):
    #[1]
    model.train()
    #[2]

    for batch_i, (imgs, targets) in enumerate(dataloader):
        #[3]
        imgs = Variable(imgs.cuda(gpu))
        targets = Variable(targets.cuda(gpu), requires_grad=False)

        loss, outputs = model(imgs, targets)
        loss.backward()

        optimizer.step()
        optimizer.zero_grad()
        
    scheduler.step()
    
    if epoch % opt['evaluation_interval'] == 0 && gpu==0:
        print("\n---- Evaluating Model ----")

        evaluation_metrics = evaluate(model)

    #[4]

I have tried to put the barrier statement in four different places (maked in the code as comments) and no matter where I put it, the code hangs once it reaches that point.

For the cases (1,2) the code executes well on the first pass, but after validation it hangs. For the case (3) the code never reaches that point after the validation pass. For the case (4) once the validation is done, it also hangs.

I have also tested running the validation on all GPUs without using the barrier, and it does not hang.

Does anyone have any idea on why this is happenning?

I read this other two posts: 1, 2. But I think that their problem is not similar to mine.

Any help would be very appreciated! Thanks!

rvarm1 · March 12, 2021, 9:46pm

Thanks for reporting this issue! If barrier() is indeed not working properly, it seems like this is a bug and it would be great to create an issue over at http://github.com/pytorch/pytorch/issues/ with an example code snippet that reproduces the issue.

Although I’m a little confused, while the barrier at any of your points should work, I’m not sure how it helps you use only one GPU for validation? A barrier will just block all processes until all processes have entered the barrier.

Manuel_Alejandro_Dia · March 12, 2021, 10:21pm

I am also confused about this. My thought process is just that it seems like a waste of power to do the same validation step on all GPUs at the same time.

Right now, my validation is coded to be done by a single process (and by consequence a single GPU), so there wouldn’t be any performance gain running it across multiple GPUs.

I will try to reproduce the error and post an issue on Monday, since currently I don’t have access to the machine.

I guess in the end I will have to just run validation on all GPUs.

Manuel_Alejandro_Dia · March 13, 2021, 12:01am

What important information would be recommended for me to put on the issue @rvarm1 ?

Thanks!

rvarm1 · March 15, 2021, 11:57pm

It would be best if you could provide a minimal example script in which you call barrier() as you expect to do here and it fails in the issue description. Thank you!

Manuel_Alejandro_Dia · March 16, 2021, 9:58am

Issue created!

I’m sorry it took me a day extra

mrshenli · March 17, 2021, 8:51pm

This is resolved, please see discussion in Using torch.distributed.barrier() makes the whole code hang · Issue #54059 · pytorch/pytorch · GitHub

Manuel_Alejandro_Dia · March 18, 2021, 9:22am

As @rvarm1 suggested in the Github issue, the problem is solved by using the local model when running the validation, not the DDP one.

So instead of using:

evaluation_metrics = evaluate(model)

I should use:

evaluation_metrics = evaluate(model.module)

Thanks!