We are working on deploying pytorch in a GPU cluster. But find the distributed process group cannot be established.
Say we have two machines and in each machine we create a docker where pytorch runs in. Since the network mode is “bridge mode” for the dockers, the process in docker of machine A cannot directly connect to that in docker of machine B. It makes the initialization of process group failed.
If we change the network mode to ‘host mode’, everything is OK. But we have to use the “bridge mode”. Is there any way to solve this problem?
pytorch version: 0.4.1
backend: gloo
Script: the script is downloaded from https://github.com/pytorch/examples/tree/master/imagenet
Distributed mode: we use DistributedDataParallel and each gpu is governed by a process.