I am trying to send a tensor from GPU0 to all the othre GPUs using the broadcast function, on a single machine with 2 GPUs, but it seems that this function was not working for me. You can find a toy example below. Can someone tell me what I did wrong here?
Thanks a lot for your help.
import torch
import torch.distributed as dist
def broadcast_tensor():
dist.init_process_group(backend='nccl')
# group = dist.new_group(list(range(2)))
local_rank = dist.get_rank()
if local_rank == 0:
t = torch.tensor([1,2,3]).to(local_rank)
else:
t = torch.empty(3).to(local_rank)
dist.broadcast(t, src=0)
print('local rank: ', local_rank, ' tensor', t)
dist.destroy_process_group()
return t
if __name__ == '__main__':
t = broadcast_tensor()
The print out messages are the following,
local rank: 0 tensor tensor([1, 2, 3], device=‘cuda:0’)
local rank: 1 tensor tensor([1.4013e-45, 0.0000e+00, 2.8026e-45], device=‘cuda:1’)
I have found a version that would distribute the tensors to all GPU device, but I have to create non-empty tensors first on each GPU. Does anyone know why this is the case? I found many online examples that first create empty tensors on non-src rank devices. I am using the official torch GPU docker image with version of 2.1.0.
local rank: 0 Before broadcast, tensor: tensor([0, 0, 4], device=‘cuda:0’)
local rank: 1 Before broadcast, tensor: tensor([3, 3, 1], device=‘cuda:1’)
local rank: 0 After broadcast, tensor: tensor([0, 0, 4], device=‘cuda:0’)
local rank: 1 After broadcast, tensor: tensor([0, 0, 4], device=‘cuda:1’)