Is all_gather supposed to take more time/memory than gather?

From the definition of the two functions, I would imagine that all_gather would take more time and memory, because it has more communication to do (to non-root-rank machines), and the non-root-rank machines will have to store the all_gathered result, so if we’re on a 4-gpu machine and we’re doing all_gather on cpu, the CPU memory of the machine should be 4 times more than doing a gather.

However experimentally that seems to not be the case, the memory and runtime usage of the two are similar, does anyone know why?