It seems that from the doc here: Distributed communication package - torch.distributed — PyTorch 1.9.0 documentation, that GLOO backended lib on CUDA tensors doesn’t support reduce, only support all_reduce and broadcast.

But as I followed the tutorial here: Writing Distributed Applications with PyTorch — PyTorch Tutorials 1.9.0+cu102 documentation.

With a simple run function implemented bellow:

```
def run4(rank, size):
""" run4: CUDA reduction. """
n_gpus = torch.cuda.device_count()
t = torch.ones(1).cuda(rank % n_gpus)
for _ in range(1):
c = t.clone()
# dist.all_reduce(c, dist.ReduceOp.SUM)
dist.reduce(c, dst=0, op=dist.ReduceOp.SUM)
t.set_(c)
print('[{}] After reduction: rank {} has data {}, backend is {}'.format(os.getpid(), rank, t, dist.get_backend()))
```

And I tried with a world_size of 4 on a machine with only 2 GPU cards, here is the result:

```
[30425] After reduction: rank 2 has data tensor([2.], device='cuda:0'), backend is gloo
[30424] After reduction: rank 1 has data tensor([3.], device='cuda:1'), backend is gloo
[30426] After reduction: rank 3 has data tensor([1.], device='cuda:1'), backend is gloo
[30423] After reduction: rank 0 has data tensor([4.], device='cuda:0'), backend is gloo
```

Based on the results above, I assume reduce has worked on GPU because my tensors are put on CUDA deivces, am I right? However, I noticed that all the ranks have participated in the reduce algo. Meaning that rank 1, 2, 3 tensor values have also changed when performing the reduction. It seems that the reduction algo is quite naive, if I have 4 processes, it would run 3 rounds.

1st round, add rank 3 to rank 2.

2nd round, add rank 2 to rank 1.

3rd round, add rank 1 to rank 0.

So when I print the end result, not only rank 0 has the desired reduced sum value, but also rank 1 ~ (world_size -2) has also changed the value. Is this the supposed result of reduction? I thought rank 1 ~ (world_size -2) doesn’t store the imtermediate results.