What’s the definition of ‘world_size’ and ‘rank’ torch.distributed.init_process_group()?
Regarding the argument ‘world_size’, is it cross machine total device count? Or just total machine count?
Regarding the argument ‘rank’, is it an index for each machine, or an index for each devices?
For instance, if I have 2 machines and there are 4 GPUs per machine. What’s the value of ‘world_size’ and ‘rank’ I have to set when I call torch.distributed.init_process_group()?
The concepts of world_size and rank are defined on processes (hence the name process_group). If you would like to create 8 processes, then the world_size should be 8, and the ranks for them should range from 0 to 7. It is up to the application to determine how to place processes to machines. In the above cluster (2 machines, and 4 GPUs each), the best setup would be creating 4 processes on each machine, with each exclusively working on a different GPU.
“… the best setup would be creating 4 processes on each machine, …”
Hey @HuangLED, in this case, the world_size should be 8, and the ranks should range from 0-3 on the first machine and 4-7 on the second machine. This page might help explain: