Is this a correct way to combine mpi and nccl in distributed training

Hello everyone. I used the mpi to run multiprocess and use the nccl backend with DDP, Is this a correct way that I use the mpi and nccl? I’d appreciate if anybody can help me! Thanks in advance!
here is my sample code:

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def dist_train(rank, size):
    local_rank = int(os.environ['OMPI_COMM_WORLD_LOCAL_RANK'])
    if args.gpu:

    # set torch device
    device = torch.device("cuda" if args.gpu and torch.cuda.is_available() else "cpu")
    model =
    model = DDP(model, device_ids=[local_rank])

    '''training code......'''

def init_process(rank, size, fn, backend='gloo'):
    dist.init_process_group(backend, init_method='tcp://master_ip:port', rank=rank, world_size=size)

    fn(rank, size)

world_size = int(os.environ['OMPI_COMM_WORLD_SIZE'])
world_rank = int(os.environ['OMPI_COMM_WORLD_RANK'])
init_process(world_rank, world_size, dist_train, backend='nccl')

My running command is : mpirun -np ${totals} -H ${slots} ${COMMON_MPI_PARAMETERS} python

This should be fine, did you see any error?