Hi! I have some questions regarding the recommended way of doing multi-node training from inside docker. Concretely, all my experiments are run in a docker container on each node and it is straightforward with
torchrun when I only need distributed training on a single-node. However:
- if I need multi-node training, can I simply call
torch.distributed.launchinside each docker separately, the same as without using docker?
- What settings do I need when building the docker image and bringing up the docker container (e.g, network, etc)?
- Any other recommended practice when working with docker?
Thanks a lot!