From the definition of the two functions, I would imagine that all_gather would take more time and memory, because it has more communication to do (to non-root-rank machines), and the non-root-rank machines will have to store the all_gather
ed result, so if we’re on a 4-gpu machine and we’re doing all_gather
on cpu, the CPU memory of the machine should be 4 times more than doing a gather
.
However experimentally that seems to not be the case, the memory and runtime usage of the two are similar, does anyone know why?