How to use Distributed Data Parallel properly

I’m relatively new to pytorch, but have good experience with Keras & Tensor flow. I’ve followed this article: DistributedDataParallel to use DDP on my own training script. But for some reason, I always end up getting process 0 terminated with exit status 1.

Here’s how my functions related to DDP look like:

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'

    # initialize the process group
    dist.init_process_group("gloo", rank=rank, world_size=world_size)

    # Explicitly setting seed to make sure that models created in two processes
    # start from same random weights and biases.

def cleanup():
def run_demo(fn, *args):
             args = (args[0],args[1], args[2], args[3], args[4]),

             nprocs = 1 # Also tried 2 , but no difference
             join = True

And here’s how my train function looks like:

def train(model, X, batch_size = 32, epochs = 75, gradient_acc = 0):
    setup(1, 2)
    device = model.get_default_device()
    model =, non_blocking = True)
    ddp_model = DDP(model, device_ids = [0]) # Only one GPU

    ddp_model.hidden_enc = ddp_model.init_hidden_enc()
    ddp_model.hidden_dec = ddp_model.init_hidden_dec()
    for ep in range(epochs):
        loss_br = 0; nb_batch_steps = 0

        for step, batch in enumerate( data_loader ):
            batch =, non_blocking = True)
            nb_batch_steps += 1
            loss = ddp_model(batch)

I’m calling the run_demo function in this way:

if __name__ == "__main__":
    run_demo(train, model, 
         holder[:], 32, 
         75,3 )

I can make out that some process in the system is failing and that’s the reason why is raising that error. But, I’m not sure how to rectify that issue. If I call my train function directly without intervention of run_demo the code never executes and programs seems to go in infinite loop.

I’m on Google Colab, with single GPU.

P.S: My lscpu command results in:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              2
On-line CPU(s) list: 0,1
Thread(s) per core:  2
Core(s) per socket:  1
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU @ 2.30GHz
Stepping:            0
CPU MHz:             2300.000
BogoMIPS:            4600.00
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            46080K
NUMA node0 CPU(s):   0,1
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat arch_capabilities

Any help is highly appreciated. Thanks !


There are at least 2 things off in this example:

  • mp.spawn calls the specified function with the local rank as first argument. This means that the arguments you pass are off by 1 (this is likely what causes the first error).
  • You’re calling setup(1, 2) even if you run with a single process. This will cause a hang followed by a timeout after 30 minutes.

Good luck!

