Why local_rank and not global rank

boda_Sadalla · March 7, 2022, 7:49am

why do we use local rank in distributed training for logging and saving models, instead of using the global rank, my understanding is when we use local rank, then we would log and save once for each node, while it would happen just once across all nodes if we used the global rank

suraj.pt · March 7, 2022, 3:18pm

Yes, the rank needs to be the global rank in case of multi-node distributed training. Are you referring to a particular tutorial/example which uses local rank in a multinode setting? It should have line that converts local to global, like global_rank = local_rank * ngpus_per_node + gpu where local_rank is the node rank and gpu is the device id on the node

boda_Sadalla · March 14, 2022, 7:21am

Yes, One example im working with is NVIDIA’s deep learning examples repo here: they are using the local rank and that sometimes can cause errors, specially when two nodes are trying to write on the same time

rvarm1 · March 14, 2022, 10:36pm

I think the goal with that code is to take a checkpoint on each node, so that nodes can individually reload checkpoints if training crashes and there is a need to resume.

For example, if checkpoint is taken only on global rank 0 and that node crashes, training information would be lost while if it was taken on all nodes, there is a higher likelihood of availability.

boda_Sadalla · March 15, 2022, 6:49am

aha, I understand your point, but then how we account for the bugs addressed here.
also in my case, I an using slurm, so all nodes save checkpoint in a centralized fashion on my training folder, so there’s no need for that.

I am just wondering if my understanding that on most cases, we need to use the global_rank for logging, saving, and not the local_rank