Hi,
I am trying to scale training across several nodes on HPC cluster where jobs are submitted via SLURM. I want to use gloo backend because open-mpi implementation i am using does not support fast intra-node memory operations through GPU-Direct and it is necessary to copy gradients to the main memory before calling all_reduce, which is costly.
Single node training with gloo backend works fine when I point MASTER_ADDR to 127.0.0.1 which is a local ip address on each node. However I am struggling to initialize distributed process group using shared file system (NFS). Can somebody provide a working code snippet for group initialization with shared file system with gloo (not just the init part, but also process creation part) on a single node? Or suggest how should I modify the example below so that it uses shared file system for initialization?
def init_processes(rank, size, fn, backend='tcp'):
""" Initialize the distributed environment. """
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
dist.init_process_group(backend, rank=rank, world_size=size)
fn(rank, size)
if __name__ == "__main__":
size = 2
processes = []
start = time.time()
for rank in range(size):
p = Process(target=init_processes, args=(rank, size, run, 'gloo'))
p.start()
processes.append(p)
for p in processes:
p.join()
end = time.time()
print "Elapsed time: %.6f" % (end - start)