Model declaration distributed data parallel

haowxu · August 9, 2019, 9:37am

Hi, guys. I’m trying to build a simple distributed data parallel training program with 1 GPU per process. Firstly I followed https://pytorch.org/tutorials/intermediate/dist_tuto.html and added some modification.

def run(rank, size):
    ...
    model = Net()
    ...

if __name__ == '__main__':
    ...
    for rank in range(size):
        p = Process(target=init_process, args=(rank, size, run))
        ...

However, after reading the example https://github.com/pytorch/examples/tree/master/mnist_hogwild, I found model is one of the arguments of a process:

if __name__ == '__main__':
    ...
    model = Net().to(device)
    ...
    for rank in range(args.num_process):
        p = mp.Process(target=train, args=(rank, args, model, device, dataloader_kwargs))
        ...

So I just wonder: where should the model be declared, inside or outside these processes? When and how do the gradients get synchronized?

pietern · August 12, 2019, 11:25am

If you don’t care for doing hogwild, the second example you list is not applicable to you. Everything is declared in the primary process because the model weights are to be shared (physically shared through shared memory) between the worker processes.

If you want to simply use a single process per GPU and don’t care for physical weight sharing, then you should declare everything in the subprocesses. Think of it as if you were running this on different machines instead of multiple processes on a single machine. Then you’d also have to declare everything in the processes that you launch on those machine instead of a single launcher processes.

haowxu · August 13, 2019, 6:44am

Big thanks to your reply! So hogwild is not a must for distributed data parallel, even if the models are declared in different processes, they can synchronize their parameters and gradients in some way. Is it right?
I tried saving the models at the end of these processes using the following code, but I found the parameters of the models are different. Does this mean that I made something wrong?

def run(rank, size):
    ...
    model = Net()
    ...
    for epoch in range(args.epochs):
        for batch_idx, (input, target) in enumerate(train_loader):
            output = model(input)
            loss = criterion(output, target)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    ...
    model.save({...}, f'checkpoint-process-{rank}.pth')

pietern · August 13, 2019, 7:08am

You need to wrap your model with nn.DistributedDataParallel, see https://pytorch.org/tutorials/intermediate/ddp_tutorial.html for a tutorial.

haowxu · August 13, 2019, 9:28am

Surely I did this. Sorry for omitting this in the code above. It’s like:

...
torch.cuda.set_device(rank)
...
model = Net().cuda()
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[rank], output_device=rank)
...

However, the parameters I got still differ.

haowxu · August 14, 2019, 2:24am

I tried saving the model just after the forward, the parameters of the saved models are the same.

def run(rank, size):
    ...
    for epoch in range(args.epochs):
        output = model(input)
        loss = criterion(output, target)

        torch.save({...}, f'checkpoint-process-{rank}-epoch-{epoch}.pth')

        ...

    ...

Still checking the sync of parameters and gradients in the source code.

haowxu · August 16, 2019, 10:19am

Update: Just refering the ImageNet example can help sync the parameters. I was using the wrong way to check the parameters of the model.