Distributed broadcast operation slows down training speed

Hi everyone! I use a piece of PyTorch code that runs on a single machine distributed setting. The code contains all_gather and all_reduce operations to gather predictions from each gpu and calculate metrics, respectively, without any noticeable slowing down in training speed. I recently added an extra bit in the code for a new thing that I’m trying where I need to broadcast a probability sampled from a uniform distribution (single number) in all gpus, so all of them have the same probability value. I added it within the training loop and looks like this:

if dist.get_rank() == 0:
    prob = torch.rand(1).cuda()
else:
    prob = torch.zeros(1).cuda()
dist.broadcast(prob, 0)

After adding this, the code becomes significantly slower! Any ideas why? Thank you!

I think in your example the creation of prob when rank != 0 should be prob=torch.zeros(1, device=rank).
Your codes:

def worker(gpu, *args):
    ...
    if gpu == 0:
        prob = torch.rand(1).cuda() #.to(device=gpu)
        print(gpu, prob)
    else:
        prob = torch.zeros(1).cuda() # .to(device=gpu)
        print(gpu, prob)

# output
0 tensor([0.1045], device='cuda:0')
1 tensor([0.], device='cuda:0') <- same on cuda:0

Specify device

def worker(gpu, *args):
    ...
    if gpu == 0:
        prob = torch.rand(1, device=gpu)
        print(gpu, prob)
    else:
        prob = torch.zeros(1, device=gpu)
        print(gpu, prob)

# output
0 tensor([0.8787], device='cuda:0')
1 tensor([0.], device='cuda:1')

Hi David! Thank you for your comment. I also thought of that but the PyTorch documentation got me confused. In particular, in torch.Tensor.cuda — PyTorch 1.9.0 documentation for the device argument, they say:
device (torch.device) – The destination GPU device. Defaults to the current CUDA device.

So I thought that if you don’t specify it the default would be the current CUDA device rather than always 'cuda:0'??

That’s up to whether you have set the current device.

def worker(gpu, *args):
    ...
    torch.cuda.set_device(gpu) # <- This line makes the difference
    if gpu == 0:
        prob = torch.rand(1).cuda()
        print(gpu, prob)
    else:
        prob = torch.zeros(1).cuda()
        print(gpu, prob)

# output
0 tensor([0.6916], device='cuda:0')
1 tensor([0.], device='cuda:1') # <- on correct device

This is super helpful! Thank you so much! I’ll keep you updated for the effect the fix has on the training speed. :slight_smile: