From the doc, Distributed communication package - torch.distributed — PyTorch 1.9.0 documentation,
seems like no backend supports send and recv with CUDA tensors. But I still tried to test it.
So, here is my run
function:
def run1(rank, size):
""" run1: Simple P2P synchronously."""
tensor = torch.zeros(1).cuda(rank)
if rank == 0:
tensor += 1
# Send the tensor to process 1
dist.send(tensor=tensor, dst=1)
else:
# Receive the tensor from process 0
# tensor += 10
# dist.send(tensor=tensor, dst=1)
dist.recv(tensor=tensor, src=0)
# dist.recv(tensor=tensor)
print("Rank {} has data {}, with addr {}".format(rank, tensor[0], tensor.data_ptr()))
With 2 processes, NCCL backend, it seems I can get correct results:
root@298562e873aa:/opt/sw_home/pytorch-distributed# python distributed.py -f 1 -b nccl
Rank 1 has data 1.0, with addr 139654443565056
Rank 0 has data 1.0, with addr 139731618758656
However, with 2 processes, gloo backend, I get runtime errors:
root@298562e873aa:/opt/sw_home/pytorch-distributed# python distributed.py -f 1 -b gloo
Process Process-2:
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/sw_home/pytorch-distributed/distributed.py", line 98, in init_process
fn(rank, size)
File "/opt/sw_home/pytorch-distributed/distributed.py", line 41, in run1
dist.recv(tensor=tensor, src=0)
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 850, in recv
pg.recv([tensor], src, tag).wait()
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [172.17.0.13]:31389
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/sw_home/pytorch-distributed/distributed.py", line 98, in init_process
fn(rank, size)
File "/opt/sw_home/pytorch-distributed/distributed.py", line 36, in run1
dist.send(tensor=tensor, dst=1)
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 805, in send
default_pg.send([tensor], dst, tag).wait()
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:378] writev [172.17.0.13]:5547: Bad address
So, I’m not sure with CUDA tensors, can we use send
, recv
for NCCL or GLOO? Then actual reason why I do these experiments is that I’m running on a machine without CUDA UVA support, so it only supports cudaMemcpyPeer(Async)
, not cudaMemcpy(Async)
. But as I checked in gloo source code, there is no cudaMemcpyPeer
usage at all, only cudaMemcpyAsync
. Thus I’m not sure if pytorch DDP with gloo with CUDA tensors will work as expected.