Run multi-node training inside docker

Hi! I have some questions regarding the recommended way of doing multi-node training from inside docker. Concretely, all my experiments are run in a docker container on each node and it is straightforward with torch.distributed.launch or torchrun when I only need distributed training on a single-node. However:

  1. if I need multi-node training, can I simply call torch.distributed.launch inside each docker separately, the same as without using docker?
  2. What settings do I need when building the docker image and bringing up the docker container (e.g, network, etc)?
  3. Any other recommended practice when working with docker?

Thanks a lot!

This should work as long as you setup the network connectivity across docker containers appropriately such that they can talk to each other.

1 Like

Thank you! I just tried and it works same as usual as long as docker network is set up properly. I used --network=host when launching the container which can make it even more seamless.