Doing evaluation after every train epoch only on rank_0.
During evaluation I observed (through nvidia-smi) that other (1, 2, 3) ranks/gpus continue to be processing something with 100% load.
My questions:
Is it possible that other ranks continuing training next epoch without waiting for rank_0 to finish evaluation?
In case (1) is true, is it ok to leave it like this (will rank_0 process it’s part of the next epoch after it finishes evaluation)? Or is it better to set a barrier so that other ranks will wait for rank_0 to do evaluation?
Is it possible that other ranks continuing training next epoch without waiting for rank_0 to finish evaluation?
Yes, it is possible. Because all communication/synchronization happen in the backward pass. So other ranks will proceed with their next forward pass and local backward pass, and then block on the AllReduce operation in the backward pass.
In case (1) is true, is it ok to leave it like this (will rank_0 process it’s part of the next epoch after it finishes evaluation)?
Yes, it should be OK to leave it this way. Other ranks is just block waiting on AllReduce until rank_0 finishes evaluations and then runs the subsequent backward pass. It shouldn’t affect the correctness.
During evaluation I observed (through nvidia-smi ) that other (1, 2, 3) ranks/gpus continue to be processing something with 100% load.
Yes, when block waiting for AllReduce, CUDA would show busy, although there might be no real computation running.