I’ve a setup using two GPUs with DistributedDataParallel. I’m using the latest nightly build.
I try to gather the results, but I always endup with 2x the results of gpu 1:
items = torch.from_numpy(items[:10]).to(self.rank) #necessary because of nvcc, : 10 to make it readable
print(f"Rank {self.rank} items:")
print(items)
if self.rank == 0:
items_gather_list = [torch.zeros_like(items)] * 2
tdist.barrier()
print("GATHER")
tdist.gather(items,
gather_list=items_gather_list if self.rank == 0 else None,
dst=0)
if self.rank == 0:
print(f"LEN GATER LIST", len(items_gather_list))
print(items_gather_list[0])
print(" ")
print(items_gather_list[1])
That is the output:
Rank 1 items:
Rank 0 items:
tensor([ 1, 1073, 1589, 1453, 1997, 1075, 115, 1041, 1133, 756],
device='cuda:1')
tensor([ 44, 1723, 1039, 1321, 1065, 281, 1364, 1830, 296, 1228],
device='cuda:0')
GATHER
GATHER
LEN GATER LIST 0 2
tensor([ 1, 1073, 1589, 1453, 1997, 1075, 115, 1041, 1133, 756],
device='cuda:0')
tensor([ 1, 1073, 1589, 1453, 1997, 1075, 115, 1041, 1133, 756],
device='cuda:0')
What am I doing wrong?