Azure ML supports Horovod, but I’d like to keep things as simple as possible (but not simpler), so thinking of using DistributedDataParallel instead… has anyone done this succesfully?
It has the same purpose and result as Horovod.
The primary difference lies in how you launch a distributed run. With Horovod you go through MPI (and launch with mpirun), whereas with
torch.distributed you can launch the processes yourself, independently, and have them find each other through any one of the supported initialization methods (see https://pytorch.org/docs/stable/distributed.html#tcp-initialization).
Horovod only works with NCCL2 AFAIK (and therefore CUDA tensors). In
torch.distributed we also have a Gloo backend in case you want to run collective operations against CPU tensors.