Now I import the torch.distributed package to train my model in some nodes. And I want to use the CPU to train it so I use gloo as the backend.
When I use this package to do distributed training, I should set some Common environment variables like GLOO_SOCKET_IFNAME and MASTER_ADDR.I use the ifconfig command to print the interface, it just like this.
I want to know how to set these environment variables to running python -m torch.distributed.launch correctly.Thanks!!!
Thank you, mrshenli. I tried setting CLOO_SOCKET_IFNAME to eth0 and MASTER_ADDR to inet6 string, it returned a connection error(Timeout). Then I found gn0 might be an Infiniband Interface, so I set CLOO_SOCKET_IFNAME to gn0 and MASTER_ADDR to ipv4 address.
From the code, I think the MASTER_ADDR can be set to ipv6, thanks for your help!