I use the MPI as the backend in distributed pytorch among multiple nodes.
exec code:
mpirun --hostfile hostfile -n 2 python *.py
and I revised the init line which looks like init line in pytorch distributed_test.py file ,
the init command line
os.environ[‘MASTER_ADDR’] = '172.31.7.117’
os.environ[‘MASTER_PORT’] = ‘23456’
dist.init_process_group(init_method='env://', backend='mpi')
group = dist.new_group([i for i in range(dist.get_world_size())])
I am sure that openmpi is installed correctly. Pytorch is build from source with MPI support.
I really appreciate any help.