Using Pytorch's multiprocessing along with distributed package

I am trying to spawn a couple of process using pytorch’s multiprocessing module within a openmpi distributed back-end. What I have is the following code:

def run(rank_local, rank, world_size, maingp):
    print("I WAS SPAWNED ", rank_local, " OF ", rank)
    
    tensor = torch.zeros(1)
    tensor += 1
    
    if rank == 0:
        tensor += 100
        dist.send(tensor, dst=1)
    else:
        print("I am spawn: ", rank, "and my tensor value before receive: ", tensor[0])
        dist.recv(tensor, src=0)
        print("I am spawn: ", rank, "and my tensor value after  receive: ", tensor[0])
    

if __name__ == '__main__':
    
    # Initialize Process Group
    dist.init_process_group(backend="mpi", group_name="main")
    maingp = None #torch.distributed.new_group([0,1])
    mp.set_start_method('spawn')    
    
    # get current process information
    world_size = dist.get_world_size()
    rank = dist.get_rank()
    
    # Establish Local Rank and set device on this node
    mp.spawn(run, args=(rank, world_size, maingp), nprocs=1)

I run this code using the openmpi as follows:

mpirun -n 2 python code.py

So my understanding is that mpirun creates two process with ranks [0, 1], each of these process spawn new process with their local rank as 0. Now if I want to communicate between these two sub-processes of the main process I get some Traceback and following error:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/usama/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/usama/code/test/code.py", line 19, in run
    dist.send(tensor, dst=1)
  File "/home/usama/anaconda3/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 666, in send
    _check_default_pg()
  File "/home/usama/anaconda3/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 191, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized

My question is how do I make these sub-processes to be able to communicate i.e the [0, 0] process sending something to [1, 0] process. Any ideas?

I have asked this on stack-overflow as well but to no avail!!!

above suggests the init_process_group method is not called on the process that tries to use the distributed package. I think the follow line needs to be moved to the run method, and it is the entry point for the spawned process:

# Initialize Process Group
dist.init_process_group(backend="mpi", group_name="main")

BTW, the way you call init_process_group does not look correct to me. You will need to either provide rank+world_size or provide an initialized store. The former will be easier.