Torch.distributed: Send and Gather on GPU

Hello!

I want to write a distributed program and run it on a cluster with several multi-GPU nodes which is managed using slurm.

The program should have one master process, which sends (equal to MPI_Send / MPI_Recv) different data to other processes and then collect the results (equal to MPI_Gather).

Could you please tell me if my task can be solved using torch.distributed? In the official docs (https://pytorch.org/docs/stable/distributed.html) I found only question marks for send/recv MPI operations for GPU.

I also tried Horovod but found no wrappers around send/recv functions.

The question marks mean that it depends whether or not your MPI distribution is compiled with CUDA support or not. If it is, send/recv of GPU tensors works. If it doesn’t, you’ll have to copy GPU tensors to CPU before you can pass them to send/recv. (see torch.Tensor.cpu).