Example of torch.distributed

Hi,

I am trying to use the torch.distributed package on a cluster. But, I can’t find relevant docs or an example to start with.

I am aware of the this link. But, I think the API has changed considerably from what is mentioned in this link. For example the function get_rank is now defined torch.distributed.collectives. Another thing is I am unable to locate the file pytorch_exec script to launch my program.

Please point me in the right direction.

Thanks and regards,
Shirin

as mentioned eksewgerem distributed is not ready for usage yet, we are changing a lot of things. we’ll announce when it’s ready.

Is there any advance on this? Am looking to use PT distributed on a cluster

You can have a look at the imagenet example, which contains options to train in distributed mode.

Great thanks very much!