Run multi-node training inside docker

CDhere · December 5, 2022, 8:36am

Hi! I have some questions regarding the recommended way of doing multi-node training from inside docker. Concretely, all my experiments are run in a docker container on each node and it is straightforward with torch.distributed.launch or torchrun when I only need distributed training on a single-node. However:

if I need multi-node training, can I simply call torch.distributed.launch inside each docker separately, the same as without using docker?
What settings do I need when building the docker image and bringing up the docker container (e.g, network, etc)?
Any other recommended practice when working with docker?

Thanks a lot!

pritamdamania87 · December 5, 2022, 7:51pm

This should work as long as you setup the network connectivity across docker containers appropriately such that they can talk to each other.

CDhere · December 6, 2022, 12:31am

Thank you! I just tried and it works same as usual as long as docker network is set up properly. I used --network=host when launching the container which can make it even more seamless.