Pytorch elastic and local set of machines

Hi all!

I have set a machines in my local network with GPU (some have one GPU, some few, and there is 3 types of GPU models)
I would like to turn this machines for somewhat cluster to use multi-node training for my team.
And now I am considering different variants how to do it.
One of this variant is torch.distributed.elastic. But I do not understand how it works.
I found a couple examples but all of them for some cloud provider with kubernetes which is not my case.

So some question (let’s assume I do not want to install kubernetes cluster yet)

  1. What is c10d? I really unable to google it.
  2. Quick start examples just pass to torchrun but what about virtualenv or conda env. Is it possible to run docker image like this?
  3. Is it not clear for me what is ElasticAgent and how to use it?

cc @aivanou @Kiuk_Chung

Torchelastic(torch.distributed.elastic) provides you with ability to execute a single distributed job. For example, if you have 2 machines, and you need to execute a distributed job on them, you would need to log in to both machines and run the following command:

#machine 1
python -m --rdzv_id my_id --rdzv_backend c10d --rdzv_endpoint IP_MACHINE1:29400 --nnodes 2 --nproc_per_node 2

#machine 2
python -m --rdzv_id my_id --rdzv_backend c10d --rdzv_endpoint IP_MACHINE1:29400 --nnodes 2 --nproc_per_node 2

In general, if you want to create a cluster among N machines that can be used for many jobs, you would need to set up something like SLURM: Slurm Workload Manager - Documentation.

  1. What is c10d ? > it stands for caffe2 and aten, we probably need to remove it, since it is confusing for users.
  2. The virtualenv you would need to handle by yourself, with torchX it is possible to use docker images.

We are currently working on TorchX: TorchX — PyTorch/TorchX main documentation that should allow you to execute your jobs easily.
I can help you to set up the cluster and maybe you will find TorchX helpful. Write me in slack: csivanou at gmail dot com