How to use Distributed Data Parallel properly

I’m relatively new to pytorch, but have good experience with Keras & Tensor flow. I’ve followed this article: DistributedDataParallel to use DDP on my own training script. But for some reason, I always end up getting process 0 terminated with exit status 1.

Here’s how my functions related to DDP look like:

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'

    # initialize the process group
    dist.init_process_group("gloo", rank=rank, world_size=world_size)

    # Explicitly setting seed to make sure that models created in two processes
    # start from same random weights and biases.
    torch.manual_seed(42)


def cleanup():
    dist.destroy_process_group()
    
    
    
def run_demo(fn, *args):
    mp.spawn(fn,
             
             args = (args[0],args[1], args[2], args[3], args[4]),

             nprocs = 1 # Also tried 2 , but no difference
             join = True
             )

And here’s how my train function looks like:


def train(model, X, batch_size = 32, epochs = 75, gradient_acc = 0):
  
  
    setup(1, 2)
    
    
    
    device = model.get_default_device()
    
    
    
    model = model.to(device, non_blocking = True)
    
    
    ddp_model = DDP(model, device_ids = [0]) # Only one GPU
    
..
..
..
..

    
    
    ddp_model.hidden_enc = ddp_model.init_hidden_enc()
    
    
    ddp_model.hidden_dec = ddp_model.init_hidden_dec()
    
    
    ddp_model.train()
    
    for ep in range(epochs):
        
     
        loss_br = 0; nb_batch_steps = 0

        
        for step, batch in enumerate( data_loader ):
          
            batch = batch.to(device, non_blocking = True)
            
      
            nb_batch_steps += 1
            
            loss = ddp_model(batch)
            
            ..
             ..
             ..
    cleanup()
      


I’m calling the run_demo function in this way:

if __name__ == "__main__":
    
    run_demo(train, model, 
         holder[:], 32, 
         75,3 )

I can make out that some process in the system is failing and that’s the reason why spawn.py is raising that error. But, I’m not sure how to rectify that issue. If I call my train function directly without intervention of run_demo the code never executes and programs seems to go in infinite loop.

I’m on Google Colab, with single GPU.

P.S: My lscpu command results in:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              2
On-line CPU(s) list: 0,1
Thread(s) per core:  2
Core(s) per socket:  1
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU @ 2.30GHz
Stepping:            0
CPU MHz:             2300.000
BogoMIPS:            4600.00
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            46080K
NUMA node0 CPU(s):   0,1
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat arch_capabilities

Any help is highly appreciated. Thanks !

Hello!

There are at least 2 things off in this example:

  • mp.spawn calls the specified function with the local rank as first argument. This means that the arguments you pass are off by 1 (this is likely what causes the first error).
  • You’re calling setup(1, 2) even if you run with a single process. This will cause a hang followed by a timeout after 30 minutes.

Good luck!

1 Like