[BUG?] DistributedDataParallel cannot be destroyed

CuriousCat-7 · October 18, 2019, 3:24am

My code involves two stage, like
model_a -> do something and model_b -> do something
And I use DistributedDataParallel(DDP) to accelerate them rather than DP. So I do something like that:

model_a -> model_a = setup DDP and DDP(model_a) -> model_b -> setup DDP and model_b = DDP(model_b)

This will cause problem because you cannot open start two DDP in one process.

So I use dist.destroy_process_group(). But when I use this function, the program will wait for endless time.

Here are my start DDP and destroy DDP codes:

 def setup(rank, world_size):
     os.environ['MASTER_ADDR'] = 'localhost'
     os.environ['MASTER_PORT'] = '12355'

     # initialize the process group
     dist.init_process_group("gloo", rank=rank, world_size=world_size)
     #dist.init_process_group("nccl", rank=rank, world_size=world_size)

     # Explicitly setting seed to make sure that models created in two processes
     # start from same random weights and biases.
     torch.manual_seed(42)


 def cleanup():
     dist.destroy_process_group()

I’ve tried either gloo backend or nccl backend

For Pytorch 1.1, python 3.6, CUDA 9.0

Thank you

pritamdamania87 · October 22, 2019, 12:37am

model_a -> model_a = setup DDP and DDP(model_a) -> model_b -> setup DDP and model_b = DDP(model_b)

I’m not sure I follow this completely, does model_b use the output of model_a? Could you share some code about how model_a and model_b are initialized and trained using DDP?

Is it possible to create a single model with model_a and model_b as submodules and then use that as part of DDP?

CuriousCat-7 · November 13, 2019, 1:08pm

Sorry for late, I mean I setup DDP for model_a and setup DDP for model_b in the same time:
Looks like:

setup()
model_a = DDP(model_a)
setup()
model_b = DDP(model_b)

This will cause an error.

So I need to change like that:

setup()
model_a = DDP(model_a)
cleanup()
setup()
model_b = DDP(model_b)

but, after the cleanup, the program will be blocked.

pritamdamania87 · November 13, 2019, 9:26pm

I tried the following program (which is a bit similar to what you were doing), but couldn’t reproduce the issue:

import os
import tempfile
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
import torch.multiprocessing as mp

from torch.nn.parallel import DistributedDataParallel as DDP


def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'

    # initialize the process group
    dist.init_process_group("gloo", rank=rank, world_size=world_size)

    # Explicitly setting seed to make sure that models created in two processes
    # start from same random weights and biases.
    torch.manual_seed(42)


def cleanup():
    dist.destroy_process_group()

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))

setup(0, 1)
print ("F1")
model_a = DDP(ToyModel())
print ("F2")
cleanup()
print ("F3")
setup(0, 1)
print ("F4")
model_b = DDP(ToyModel())
print ("F5")

The program prints:

F1
F2
F3
F4
F5

Could you share more details about your environment (OS, python version etc)? Also, do you know at which line after the cleanup the program blocks on?