Hi everyone! I use a piece of PyTorch code that runs on a single machine distributed setting. The code contains
all_reduce operations to gather predictions from each gpu and calculate metrics, respectively, without any noticeable slowing down in training speed. I recently added an extra bit in the code for a new thing that I’m trying where I need to broadcast a probability sampled from a uniform distribution (single number) in all gpus, so all of them have the same probability value. I added it within the training loop and looks like this:
if dist.get_rank() == 0: prob = torch.rand(1).cuda() else: prob = torch.zeros(1).cuda() dist.broadcast(prob, 0)
After adding this, the code becomes significantly slower! Any ideas why? Thank you!