Node, rank, local_rank

Hi,

in torch.distributed:
node means the machine(computer) id in the network.
rank, means global_rank, means the process id in the network
local_rank, means the process id in local machine(computer).

Is the my understanding above correct?

in my code,
local_rank could be read from args.local_rank if torch.distributed is applied. otherwise local_rank doesn’t exist in args, right?

how could I get node id and rank in my code?

Hi, your understanding is correct. Here are the definitions we also refer to in documentation (torch.distributed.run (Elastic Launch) — PyTorch master documentation)

I assume you are using torch.distributed.launch which is why you are reading from args.local_rank. If you don’t use this launcher then the local_rank will not exist in args.

As of torch 1.9 we have a improved and updated launcher (torch.distributed.run (Elastic Launch) — PyTorch master documentation) in which you can read local_rank and rank from the environment variables. If you need the node id you’re code is on you can identify it through its hostname?

@H-Huang ,
Thank you!

Yes, I am using torch.distributed.launch. Thank you for your reply!