To read the official doc, it totally confuse me.
What’s the definition of ‘world_size’ and ‘rank’ torch.distributed.init_process_group()?
Regarding the argument ‘world_size’, is it cross machine total device count? Or just total machine count?
Regarding the argument ‘rank’, is it an index for each machine, or an index for each devices?
For instance, if I have 2 machines and there are 4 GPUs per machine. What’s the value of ‘world_size’ and ‘rank’ I have to set when I call torch.distributed.init_process_group()?