Does GLOO on CUDA tensors support reduce op?

From the doc, Distributed communication package - torch.distributed — PyTorch 1.9.0 documentation,
seems like no backend supports send and recv with CUDA tensors. But I still tried to test it. :sweat_smile:
So, here is my run function:

def run1(rank, size):
    """ run1: Simple P2P synchronously."""
    tensor = torch.zeros(1).cuda(rank)
    if rank == 0:
        tensor += 1
        # Send the tensor to process 1
        dist.send(tensor=tensor, dst=1)
    else:
        # Receive the tensor from process 0
        # tensor += 10
        # dist.send(tensor=tensor, dst=1)
        dist.recv(tensor=tensor, src=0)
        # dist.recv(tensor=tensor)
    print("Rank {} has data {}, with addr {}".format(rank, tensor[0], tensor.data_ptr()))

With 2 processes, NCCL backend, it seems I can get correct results:

root@298562e873aa:/opt/sw_home/pytorch-distributed# python distributed.py -f 1 -b nccl
Rank 1 has data 1.0, with addr 139654443565056
Rank 0 has data 1.0, with addr 139731618758656

However, with 2 processes, gloo backend, I get runtime errors:

root@298562e873aa:/opt/sw_home/pytorch-distributed# python distributed.py -f 1 -b gloo
Process Process-2:
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/sw_home/pytorch-distributed/distributed.py", line 98, in init_process
    fn(rank, size)
  File "/opt/sw_home/pytorch-distributed/distributed.py", line 41, in run1
    dist.recv(tensor=tensor, src=0)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 850, in recv
    pg.recv([tensor], src, tag).wait()
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [172.17.0.13]:31389
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/sw_home/pytorch-distributed/distributed.py", line 98, in init_process
    fn(rank, size)
  File "/opt/sw_home/pytorch-distributed/distributed.py", line 36, in run1
    dist.send(tensor=tensor, dst=1)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 805, in send
    default_pg.send([tensor], dst, tag).wait()
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:378] writev [172.17.0.13]:5547: Bad address

So, I’m not sure with CUDA tensors, can we use send, recv for NCCL or GLOO? Then actual reason why I do these experiments is that I’m running on a machine without CUDA UVA support, so it only supports cudaMemcpyPeer(Async), not cudaMemcpy(Async). But as I checked in gloo source code, there is no cudaMemcpyPeer usage at all, only cudaMemcpyAsync. Thus I’m not sure if pytorch DDP with gloo with CUDA tensors will work as expected.