in torch.distributed:
node means the machine(computer) id in the network.
rank, means global_rank, means the process id in the network
local_rank, means the process id in local machine(computer).
Is the my understanding above correct?
in my code,
local_rank could be read from args.local_rank if torch.distributed is applied. otherwise local_rank doesn’t exist in args, right?
I assume you are using torch.distributed.launch which is why you are reading from args.local_rank. If you don’t use this launcher then the local_rank will not exist in args.
As of torch 1.9 we have a improved and updated launcher (torch.distributed.run (Elastic Launch) — PyTorch master documentation) in which you can read local_rank and rank from the environment variables. If you need the node id you’re code is on you can identify it through its hostname?