How to initialize distributed process group with gloo backend using shared file system?

Rahim16 · April 29, 2018, 10:28pm

Hi,
I am trying to scale training across several nodes on HPC cluster where jobs are submitted via SLURM. I want to use gloo backend because open-mpi implementation i am using does not support fast intra-node memory operations through GPU-Direct and it is necessary to copy gradients to the main memory before calling all_reduce, which is costly.

Single node training with gloo backend works fine when I point MASTER_ADDR to 127.0.0.1 which is a local ip address on each node. However I am struggling to initialize distributed process group using shared file system (NFS). Can somebody provide a working code snippet for group initialization with shared file system with gloo (not just the init part, but also process creation part) on a single node? Or suggest how should I modify the example below so that it uses shared file system for initialization?

def init_processes(rank, size, fn, backend='tcp'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)


if __name__ == "__main__":
    size = 2
    processes = []

    start = time.time()
    for rank in range(size):
        p = Process(target=init_processes, args=(rank, size, run, 'gloo'))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()

    end = time.time()
    print "Elapsed time: %.6f" % (end - start)