Get local size in distributed training

meijieru · August 9, 2019, 9:12pm

Anyway to get number of procedures per node in distributed training?
In horovod we could use hvd.local_size(), but I found no alternative in distributed module.
Thanks.

pietern · August 12, 2019, 11:19am

There is no concept of local processes and remote processes; only the total number of processes through torch.distributed.get_world_size().

To understand if we need add this: what do you need this for?

meijieru · August 12, 2019, 5:26pm

For example, I want to fully use all CPU for data loading without overhead.
Thus I want to divide the number_workers by local size.

meijieru · August 14, 2019, 11:12pm

@pietern Any solution for this? Thanks.