Specify device_ids in barrier()

aguirguis · July 30, 2021, 2:18pm

Hello!
For some reason, I’m getting this warning:

[W ProcessGroupNCCL.cpp:1569] Rank 6 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.

This is triggered on model.cuda(device_id)

The problem is that the code sometimes really hang!
I was wondering if there is an explanation to what’s going on and optimally a solution to this issue.

Thank you very much.

amirhf · July 30, 2021, 8:55pm

I am assuming you are using a distributed launch. The warning message is self-explanatory. It seems like in each particular process other GPUs are still visible to the process. Ideally, on local_rank X, you want to only GPU X to be visible. There are some workarounds:

Set CUDA_VISIBLE_DEVICES = int(os.environ[“LOCAL_RANK”]) in your main worker function
use torch.distributed.barrier(device_ids=int(os.environ["LOCAL_RANK"])

In your case, RANK 6 is using GPU 0 for barrier but it should use GPU 6 as barrier.
You have to also be careful about how you set the device you want to use in your script. the set device should also be equal to the LOCAL_RANK.

aguirguis · July 31, 2021, 2:59pm

Thanks @amirhf for your answer.
This indeed solved the issue.

amsword · September 13, 2021, 4:31am

recently, i also saw such issues with pytorch 1.9, but if it is pytorch 1.7.1, there is no such issue.