World_size and rank torch.distributed.init_process_group()

Hi there,

To read the official doc, it totally confuse me.

What’s the definition of ‘world_size’ and ‘rank’ torch.distributed.init_process_group()?

Regarding the argument ‘world_size’, is it cross machine total device count? Or just total machine count?

Regarding the argument ‘rank’, is it an index for each machine, or an index for each devices?

For instance, if I have 2 machines and there are 4 GPUs per machine. What’s the value of ‘world_size’ and ‘rank’ I have to set when I call torch.distributed.init_process_group()?

The concepts of world_size and rank are defined on processes (hence the name process_group). If you would like to create 8 processes, then the world_size should be 8, and the ranks for them should range from 0 to 7. It is up to the application to determine how to place processes to machines. In the above cluster (2 machines, and 4 GPUs each), the best setup would be creating 4 processes on each machine, with each exclusively working on a different GPU.

5 Likes

“… the best setup would be creating 4 processes on each machine, …”

In this case, world_size should be 4, each process on every machine should have a rank from 0 to 4?

Then with this setup, how to define the rank of these two machines?

“… the best setup would be creating 4 processes on each machine, …”

Hey @HuangLED, in this case, the world_size should be 8, and the ranks should range from 0-3 on the first machine and 4-7 on the second machine. This page might help explain:

1 Like