Can not realize distributed training across machines with DDP

Yes, i have been using NCCL before.
Now I tried gloo and the program works fine. Thank you for your advice.

Note

I tried gloo and at the beginning I got “address family mismatch” error that is same as discuss.64753. I solved this by specifying GLOO_SOCKET_IFNAME.

# bash
export GLOO_SOCKET_IFNAME=eno2np1

Inspired by this, i also tried specifying NCCL_SOCKET_IFNAME, but that didn’t fix the nccl problem.

# bash
export NCCL_SOCKET_IFNAME=eno2np1

So I still can’t use nccl at the moment. Gloo’s lack of support for some operators limits the functionality in the code. Could you please give me some advice on how to solve this problem?