Can distributed pytorch be deployed in clusters with dockers using "bridge network"?

YuxiaoXu · October 10, 2018, 10:04am

We are working on deploying pytorch in a GPU cluster. But find the distributed process group cannot be established.

Say we have two machines and in each machine we create a docker where pytorch runs in. Since the network mode is “bridge mode” for the dockers, the process in docker of machine A cannot directly connect to that in docker of machine B. It makes the initialization of process group failed.

If we change the network mode to ‘host mode’, everything is OK. But we have to use the “bridge mode”. Is there any way to solve this problem?

pytorch version: 0.4.1
backend: gloo
Script: the script is downloaded from https://github.com/pytorch/examples/tree/master/imagenet
Distributed mode: we use DistributedDataParallel and each gpu is governed by a process.

cyberjoac · December 10, 2018, 2:27pm

Hi, did you succeed having it work with bridge mode ?
Thanks

YuxiaoXu · December 11, 2018, 1:39am

No. It seems there is no way except changing the network. We have to change the network to host network or overlay network.