Loss.backward hangs in a distributeddataparallel setting

Hello, I am trying to train a network using DDP. The architecture of the network is such that it consists of two sub-networks (a, b) and depending on input either only a or only b or both a and b get executed. Things work fine on a single GPU. But when expanding the network to 2 or more GPUS the backward just hangs. Any help is appreciated. Thanks.

Below is the minimal reproducible example.

import numpy as np
import os
import torch
import torch.nn as nn
import torch.distributed as dist
import torch.multiprocessing as mp
import torch.optim as optim
import torch.nn.functional as F

class NetA(nn.Module):
    def __init__(self):
        self.a1 = nn.Conv2d(3, 64, kernel_size=3, padding=1)
        self.a2 = nn.ReLU(inplace=True)
        self.a3 = nn.Conv2d(64, 32, kernel_size=3, padding=1)
        self.a4 = nn.MaxPool2d(kernel_size=8, stride=8)
        self.a5 = nn.Linear(8192, 1)

    def forward(self, data):
        if data.shape[0] == 0:
            return torch.zeros(1).cuda() #to(data['b'])
        x = self.a4(self.a3(self.a2(self.a1(data))))
        x = self.a5(torch.flatten(x, start_dim=1))
        return x

class NetB(nn.Module):
    def __init__(self):
        self.b1 = nn.Conv2d(3, 64, kernel_size=3, padding=1)
        self.b2 = nn.ReLU(inplace=True)
        self.b3 = nn.Conv2d(64, 32, kernel_size=3, padding=1)
        self.b4 = nn.MaxPool2d(kernel_size=4, stride=4)
        self.b5 = nn.Linear(2048, 1)

    def forward(self, data):
        if data.shape[0] == 0:
            return torch.zeros(1).cuda() #to(data['b'])
        x = self.b4(self.b3(self.b2(self.b1(data))))
        x = self.b5(torch.flatten(x, start_dim=1))
        return x

def main2():
    rank = int(os.environ['RANK'])
    num_gpus = torch.cuda.device_count()
    torch.cuda.set_device(rank % num_gpus)
    neta = NetA().cuda()
    netb = NetB().cuda()
    ddp_neta = torch.nn.parallel.DistributedDataParallel(
    ddp_netb = torch.nn.parallel.DistributedDataParallel(
    opt_a = optim.SGD(ddp_neta.parameters(), lr=0.001)
    opt_b = optim.SGD(ddp_netb.parameters(), lr=0.001)
    print('Finetuneing the network... on gpu ', rank)
    for i in range(0,20):
        f = np.random.rand()
        if f < 0.33:
            out_a = ddp_neta(torch.randn(4,3,128,128).to(rank))
            out_b = ddp_netb(torch.randn(2,3,32,32).to(rank))
            loss_a = F.softplus(out_a).mean()
            loss_b = F.softplus(out_b).mean()
            #print(i, loss_a, loss_b)
        elif f < 0.66 and f > 0.33:
            out_b = ddp_netb(torch.randn(0,3, 32, 32).to(rank))
            out_a = ddp_neta(torch.randn(6,3,128,128).to(rank))
            loss_a = F.softplus(out_a).mean()
            loss_b = F.softplus(out_b).mean()
            #print(i, ' loss_a ', loss_a)
            out_a = ddp_neta(torch.randn(0,3,128,128).to(rank))
            out_b = ddp_netb(torch.randn(3,3,32,32).to(rank))
            loss_b = F.softplus(out_b).mean()
            loss_a = F.softplus(out_a).mean()
            #print(i, ' loss_b ', loss_b)
        print(i, loss_a, loss_b)

if __name__ == '__main__':

Any suggestions how to fix this? Do I need to use dist.all_gather and/or dist.all_reduce to run the above snippet on multiple gpus? I found this link https://github.com/pytorch/pytorch/issues/23425 and tried moving the if condition to the forward of wrapper layer containing both NetA and NetB. However, that still seems to hang at the backward step.

Thanks for the help

Did you also use the find_unused_parameters=True argument as described in the issue and the docs?

find_unused_parameters (bool) – Traverse the autograd graph from all tensors contained in the return value of the wrapped module’s forward function. Parameters that don’t receive gradients as part of this graph are preemptively marked as being ready to be reduced. Note that all forward outputs that are derived from module parameters must participate in calculating loss and later the gradient computation. If they don’t, this wrapper will hang waiting for autograd to produce gradients for those parameters. Any outputs derived from module parameters that are otherwise unused can be detached from the autograd graph using torch.Tensor.detach . (default: False )

Yes. I tried with find_unused_parameters=True and still see the same hanging behaviour


What are you trying to achieve here, train 2 models (a and b) at the same time? You generate value f every time for node 1 and node 2 (f can differ between 1 and 2, right? so the code path chosen by node 1 and 2 would be different) and then you are trying to sync the gradients (via DDP). If so, node 1 can run
out_b = ddp_netb(torch.randn(0,3, 32, 32).to(rank))
out_a = ddp_neta(torch.randn(6,3,128,128).to(rank))
while node 2 could run
out_a = ddp_neta(torch.randn(0,3,128,128).to(rank))
out_b = ddp_netb(torch.randn(3,3,32,32).to(rank))

for dimension with data.shape[0] == 0 you return 0 and none of parameters are being used on one node while they are being used and sync on another. This can lead to issues with DDP I believe.

Did you intend to run the same steps for each iteration of the training loop on node 0 and node 1, e.g. sync value of f between them?

Hi @agolynski,

Thanks for the reply. The goal is not to sync value ‘f’ across gpus and is conditioned upon the input mini-batch.

More context on what we are trying to do: We want to train a network which takes either a single tensor (‘A’ of shape Nx3x128x128) or pair (tensors ‘A’ and ‘B’ of shapes Nx3x128x128 and Mx3x32x32) as input and passed to NetA and NetB respectively. For example, tensor ‘A’ would be images and tensor ‘B’ could correspond to bounding box crops generated from an off the shelf object detector (in our case we do not alter or tune the parameters of the object detector). And if there are no objects detected in tensor ‘A’ then ‘B’ would be empty and we tried setting it to shape 0 zeros along the dimension 0. NetA and NetB do not share any parameters and are independent of each other.

One possibility we are aware of, is to make sure to run NetB only when tensor B can be obtained on all gpus. So, if we are running the model using DDP on 8 gpus, we could do the backward when tensor B is not empty on all 8 gpus, which is less frequent and would cause unreasonable amount of training time. We wanted to find a way to sync gradients only on gpus on which tensor B is not empty. Hope I am clear on what we are trying to do. Thanks.