How to write the right code for distributed training?

Hi, developers:
I try to use two machines to train the network, each machine has 8 gpus. I have compiled the pytorch from source code with nccl2. When I export the port and address in the environment, like this:

export MASTER_ADDR=$(host `hostname` | awk '{print $(NF)}')
export MASTER_PORT=8080
export WORLD_SIZE=2

Then I re-write the code:

import torch.distributed as dist
import os
# init
dist.init_process_group(backend='gloo', rank=int(os.environ['OMPI_COMM_WORLD_RANK']), world_size=int(os.environ['WORLD_SIZE']))

# main

The main code is:

def main(rank):
    ... # some code
    model = torch.nn.parallel.DistributedDataParallel(model.cuda(rank), device_ids=[0,1,2,3,4,5,6,7])
    ... # some code 
    for i, (input, target) in enumerate(train_loader):
        input_var = torch.autograd.Variable(input.cuda(rank))
        target_var = torch.autograd.Variable(target.cuda(rank))

In the end, I run the code, and it is stucked with no errors. Using the gdb --args python, it output like this:

GNU gdb (GDB) Red Hat Enterprise Linux (7.2-56.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
Reading symbols from ./distributed-pytorch/distributed-pytorch-0.4.0/anaconda2/bin/python...done.
(gdb) r
Starting program: ./distributed-pytorch/distributed-pytorch-0.4.0/anaconda2/bin/python
[Thread debugging using libthread_db enabled]
Detaching after fork from child process 32217.
Detaching after fork from child process 32607.
Detaching after fork from child process 32609.

Is there something wrong in my code? Thanks.


This might be related to this thread and you can check this comment for debugging steps.

Thanks, the p2pBandwidthLatencyTest has no hangs and errors. Is this still a bug of nccl2?

    # if I initialize it with two ip, it fails.
    address = ['', '']
    # if I use one ip address, it works.
    address = ''
    size = 2
    processes = []
    for rank in range(size):
        p = Process(target=init_processes, args=(address, rank, size, dist_main))
    for p in processes: