I have set a machines in my local network with GPU (some have one GPU, some few, and there is 3 types of GPU models)
I would like to turn this machines for somewhat cluster to use multi-node training for my team.
And now I am considering different variants how to do it.
One of this variant is
torch.distributed.elastic. But I do not understand how it works.
I found a couple examples but all of them for some cloud provider with kubernetes which is not my case.
So some question (let’s assume I do not want to install kubernetes cluster yet)
- What is
c10d? I really unable to google it.
- Quick start examples just pass
torchrunbut what about
conda env. Is it possible to run docker image like this?
- Is it not clear for me what is ElasticAgent and how to use it?