Get environment variables dynamically

When using torchrun with elasticity, nodes can join or leave the group.

I want to current state of environments and I found torch.distributed.get_world_size(), torch.distributed.get_rank().
I am not sure, but these two functions seems to return values of current state.
However I can’t find the way to get the number of current nodes.

My questions are

  1. Do get_world_size(), get_rank() return value of current state?
  2. Is there a function that return the number of nodes?

Thanks!

According to document, whenever a node join or leave, all workers restart with new RANK and WORLD_SIZE values.
Also I found GROUP_RANK and GROUP_WORLD_SIZE environment variables.

In conclusion, I will use environment variables below

  • local: LOCAL_RANK, LOCAL_WORLD_SIZE
  • global: RANK, WORLD_SIZE
  • node: GROUP_RANK, GROUP_WORLD_SIZE