DistributedDataParallel with MPI and torch.distributed with MPI+multiprocessing

DDP starts fine with gloo, however, with MPI it breaks throwing the following error:
RuntimeError: MPI process group does not support multi-GPU collectives
This is when it is used as distributed without multiprocessing in a node. The error occurs at python3.8/site-packages/torch/autograd/__init__.py", line 97, in backward. However, with gloo it breaks after a couple of epochs.

With multiprocessing, on specifying the gpu-id in each process, again with gloo it starts fine and breaks after a couple of epochs, and with MPI it just does not do a dist.init_process_group; throws Local abort before MPI_INIT completed completed successfully. This is the issue that occurs even without torch.nn.parallel.DistributedDataParallel as the process group init has to happen before defining the model etc.

The only option that seems to be available is NCCL.
Does it really not work with any other backend?

Thanks