Distributed broadcast didn't send tensor to other GPUs

Hi there,

I am trying to send a tensor from GPU0 to all the othre GPUs using the broadcast function, on a single machine with 2 GPUs, but it seems that this function was not working for me. You can find a toy example below. Can someone tell me what I did wrong here?

Thanks a lot for your help.

import torch
import torch.distributed as dist

def broadcast_tensor():
    dist.init_process_group(backend='nccl')

    # group = dist.new_group(list(range(2)))
    local_rank = dist.get_rank()

    if local_rank == 0: 
        t = torch.tensor([1,2,3]).to(local_rank)
    else:
        t = torch.empty(3).to(local_rank)

    dist.broadcast(t, src=0)
    print('local rank: ', local_rank, ' tensor', t)

    dist.destroy_process_group()

    return t

if __name__ == '__main__':
    t = broadcast_tensor()

The print out messages are the following,

local rank: 0 tensor tensor([1, 2, 3], device=‘cuda:0’)
local rank: 1 tensor tensor([1.4013e-45, 0.0000e+00, 2.8026e-45], device=‘cuda:1’)

To provide more info, I was running the script like this,

torchrun --standalone --nnodes=1 --nproc-per-node=2 test_dist_broadcast.py

I have found a version that would distribute the tensors to all GPU device, but I have to create non-empty tensors first on each GPU. Does anyone know why this is the case? I found many online examples that first create empty tensors on non-src rank devices. I am using the official torch GPU docker image with version of 2.1.0.

def broadcast_tensor():
    dist.init_process_group(backend="nccl")

    local_rank = dist.get_rank()
    t = torch.randint(0, 5, size=(3,), device=local_rank)
    print('local rank: ', dist.get_rank(), 'Before broadcast, tensor: ', t)

    dist.broadcast(t, 0)
    print('local rank: ', dist.get_rank(), 'After broadcast, tensor: ', t)
    dist.destroy_process_group()
    return t

The print out messages are

local rank: 0 Before broadcast, tensor: tensor([0, 0, 4], device=‘cuda:0’)
local rank: 1 Before broadcast, tensor: tensor([3, 3, 1], device=‘cuda:1’)
local rank: 0 After broadcast, tensor: tensor([0, 0, 4], device=‘cuda:0’)
local rank: 1 After broadcast, tensor: tensor([0, 0, 4], device=‘cuda:1’)