DataParallel model stuck on loss.backward()

This is a separate issue potentially related to: Sending a tensor to multiple GPUs

I am training a DataParallel module on two GPUs. The training works as it should when:
a) Training on a single GPU, where the model is not wrapped by the DataParallel Module, regardless of batch size.
b) Training with both GPUs available, but with batch size = 1 so the data is sent to only one GPU.

However, when I increase the batch size to >1 and utilize both GPUs, the program gets stuck on the backward pass, loss.backward(). The rest of the training code (somewhat streamlined) can be found below.

Is this a hardware issue (as suggested in a comment here: Sending a tensor to multiple GPUs), or something to do with autograd (all variables have gradients, so don’t think that is the issue)?

 for batch_idx, (x, y, graph, subject) in enumerate(self.train_loader):

            if model.module.subject != subject:

                model.module.subject = subject
                model.module.graph = graph

            output = model(x.to(self.device))
            target = torch.argmax(y, dim=1)

            optimizer.zero_grad()
            loss = F.nll_loss(output, target, weight=self.w)
            loss.backward()
            optimizer.step()
1 Like

I have a similar problem which I wrote here. However, I think my problem has nothing to do with hardware because I am running this code along two GPUs without any problem.

Facing the same problem

Hey @Vijay_Viswanath could you please share a repro? BTW, which version of PyTorch are you using?

I am facing the same issues. Do we have a solution for this?

Facing the same problem with DataParallel on muti-gpu with A40 GPUS.
The code runs fine with these settings on 1080Ti, 2080 Ti and TitanXp with the identical environment setup. @semihcanturk Did you figure out a fix?