From the definition of the two functions, I would imagine that all_gather would take more time and memory, because it has more communication to do (to non-root-rank machines), and the non-root-rank machines will have to store the all_gathered result, so if we’re on a 4-gpu machine and we’re doing all_gather on cpu, the CPU memory of the machine should be 4 times more than doing a gather.
However experimentally that seems to not be the case, the memory and runtime usage of the two are similar, does anyone know why?
so if we’re on a 4-gpu machine and we’re doing all_gather on cpu, the CPU memory of the machine should be 4 times more than doing a gather .
I think the premise of “we’re on a 4-gpu machine and we’re doing all_gather on cpu” is confusing. Should it be “doing all_gather on gpu”?
My mental model is that we can look at all_gather as parallelized gathers happening for all ranks, so the memory and runtime usage is probably similar.
We run training on GPU, but the data we’re gathering is very big and would quickly go OOM on cuda, so we gather them on cpu instead before metrics computation.
If all_gather is parallelized gathers, I can understand that runtime would be similar. However wouldn’t memory still be different, because only root rank needs to store the gathered result, instead of all ranks storing the gathered result?
Or is it because, since we’re calling all_gather on cpu, PyTorch under the hood knows it’s happening on the same hardware, so it’s actually still doing just gather?
Therefore if we run all_gather on 2gpu vs 4gpu, there would be memory difference instead?