What is the difference between rank and local-rank?

AlexLuya · November 23, 2019, 8:45am

The doc confuses me quite a lot,would you please tell me:

What is the difference of two?
When must use rank and when must use local-rank?

spanev · November 23, 2019, 10:36am

In the context of multi-node training, you have:

local_rank, the rank of the process on the local machine.
rank, the rank of the process in the network.

To illustrate that, let;s say you have 2 nodes (machines) with 2 GPU each, you will have a total of 4 processes (p1…p4):

            |    Node1  |   Node2    |
____________| p1 |  p2  |  p3  |  p4 |
local_rank  | 0  |   1  |  0   |   1 |
rank        | 0  |   1  |  2   |   4 |

AlexLuya · November 24, 2019, 3:38am

@spanev,thanks,if p3 wants to send sth to p4,
1,it can use either local_rank or rank
2,but for performance,it should use local_rank
Am I right about above two?
3,why not just all use rank,and let lib to decide to which seed method(cross process or cross node) to use,what is the case that developer must use local_rank?

pritamdamania87 · November 26, 2019, 9:33pm

I’m assuming you’re referring to local_rank mentioned here: https://github.com/pytorch/pytorch/blob/master/torch/distributed/launch.py

You should use rank and not local_rank when using torch.distributed primitives (send/recv etc). local_rank is passed to the training script only to indicate which GPU device the training script is supposed to use.
You should always use rank.
local_rank is supplied to the developer to indicate that a particular instance of the training script should use the “local_rank” GPU device. For illustration, in the example above provided by @spanev, p1 is passed local_rank 0 indicating it should use GPU device id 0.