How to force pytorch to use NCCL?

Hi. I’m currently working with facebook’s Detectron2 framework based on pytorch 1.7.0, and I notice the training is using gloo by default, which is possibly prone to causes the runtime error: connection closed by peer. Now I want to give nccl a try, and what should I do to enforce this?

Hi @Hawk,

You can simply pass in backend="nccl" into dist.init_process_group to enable training with the NCCL backend (assuming that NCCL is installed on your system and pytorch is installed with NCCL support).

That’s not precise. Detectron2 uses both NCCL and Gloo under the hood.