I am trying to scale training across several nodes on HPC cluster where jobs are submitted via SLURM. I want to use gloo backend because open-mpi implementation i am using does not support fast intra-node memory operations through GPU-Direct and it is necessary to copy gradients to the main memory before calling all_reduce, which is costly.
Single node training with gloo backend works fine when I point MASTER_ADDR to 127.0.0.1 which is a local ip address on each node. However I am struggling to initialize distributed process group using shared file system (NFS). Can somebody provide a working code snippet for group initialization with shared file system with gloo (not just the init part, but also process creation part) on a single node? Or suggest how should I modify the example below so that it uses shared file system for initialization?
def init_processes(rank, size, fn, backend='tcp'): """ Initialize the distributed environment. """ os.environ['MASTER_ADDR'] = '127.0.0.1' os.environ['MASTER_PORT'] = '29500' dist.init_process_group(backend, rank=rank, world_size=size) fn(rank, size) if __name__ == "__main__": size = 2 processes =  start = time.time() for rank in range(size): p = Process(target=init_processes, args=(rank, size, run, 'gloo')) p.start() processes.append(p) for p in processes: p.join() end = time.time() print "Elapsed time: %.6f" % (end - start)