Difference between torch.cuda.synchronize() and dist.barrier()

Taejune · January 28, 2023, 8:17am

Hi, I’m wondering what is the main difference between torch.cuda.synchronize() and dist.barrier().
I know that the former prevents CPU thread from proceeding until the previous works are done,
and the latter makes processes wait until every process reaches dist.barrier().

So I think those two have the same purpose… what’s the difference?

If I was wrong, please feel free to convince me that I was wrong.

Thanks!

ptrblck · January 28, 2023, 8:40am

torch.cuda.synchronize() synchronizes the current device and waits until all GPU work is finished thus blocking the host from advancing.
dist.barrier is used in a distributed setup and synchronizes all processes until the group enters this function. Even if all GPU work is already done in one process it would still wait for all other processes until they reach the barrier before advancing.

Taejune · January 28, 2023, 12:15pm

Thanks for the quick and clear answer!
So then if I use dist.barrier for the distributed setup, does it also act like torch.cuda.synchronize() ?
What is the right usage of each function?

Thanks a lot.