Hi! I have some questions regarding the recommended way of doing multi-node training from inside docker. Concretely, all my experiments are run in a docker container on each node and it is straightforward with torch.distributed.launch
or torchrun
when I only need distributed training on a single-node. However:
- if I need multi-node training, can I simply call
torch.distributed.launch
inside each docker separately, the same as without using docker? - What settings do I need when building the docker image and bringing up the docker container (e.g, network, etc)?
- Any other recommended practice when working with docker?
Thanks a lot!