Invalid Gradient during Backward Pass

I got this error when doing the backward pass:

RuntimeError: Function ThAddmmBackward returned an invalid gradient at index 1 - expected device 1 but got 0

This occurs when executing loss.backward(), where

ad_loss = torch.nn.BCELoss().cuda()
class_loss = torch.nn.CrossEntropyLoss().cuda()
a_loss = ad_loss(ad_out, ad_label)
c_loss = class_loss(s_output, s_target_var)
loss = c_loss + a_loss

The pytorch version is: 1.0.0a0+14004cb which is the current version on Master branch. The python version is: 2.7.15. And I am using 2 GPUs. Any idea is appreciated. Thanks in advance!

Hi,

Do you use dataparallel? Could you give a small code sample that we could run to reproduce the problem please?

Hi albanD,

Thank you for your reply. I do use dataparallel with:

model = torch.nn.DataParallel(model, device_ids=args.gpus).cuda()

And this is the code snippet for computing loss and doing backward prop:

out_dict = model(torch.cat((s_input_var, t_input_var), dim=0))
output, ad_out = out_dict['output'], out_dict['ad_out']
split_size = args.batch_size // len(args.gpus)
ad_label = []
for i in range(len(args.gpus)):
    ad_label.append(torch.zeros([split_size]))
ad_label.append(torch.ones([split_size]))
ad_label = torch.cat(ad_label, dim=0)
ad_label = torch.autograd.Variable(ad_label.cuda(async=True))
a_loss = ad_loss(ad_out, ad_label)
ad_losses.update(a_loss.item(), s_input.size(0)+t_input.size(0))
s_output = output[0:args.batch_size, :]
c_loss = class_loss(s_output, s_target_var)
_, s_pred = torch.max(s_output, dim=1)

# measure accuracy and record loss
prec1 = accuracy(s_pred.data.cpu().long(), s_target.cpu().long(), lmap)
c_losses.update(c_loss.item(), s_input.size(0))
ad_losses.update(a_loss.item(), s_input.size(0))
top1.update(prec1, s_input.size(0))

loss = c_loss + a_loss
# compute gradient and do SGD step
optimizer.zero_grad()

loss.backward()

But since the model definition is in another separate file I’m not sure how to give you an executable code.

This is not really easily readable code… A smaller example that reproduces just this issue would be more helpful
I would check that torch.cat((s_input_var, t_input_var), dim=0) this op actually returns enough sample for the data parallel to work wiht? Or is it explicitly done such that s_* run on one gpu and t_* runs on the other and you always have 2 gpus?

Hi albanD,

I’m new to PyTorch. After I printed out s_input_var and t_input_var it seems that they are both on the same GPU with device='cuda:0'. Do they need to be on different GPUs? They are calculated in the following way:

s_input, s_target = source_iter.next()
t_input, _ = target_iter.next()
s_target = s_target.cuda(async=True)
s_input_var = torch.autograd.Variable(s_input).cuda()
t_input_var = torch.autograd.Variable(t_input).cuda()
s_target_var = torch.autograd.Variable(s_target)

Thanks very much!

Hi,
I have a similar issue. The question you are asking about parts use different GPUs is the case in mine. (I don’t use data parallelism).
If I use different GPUs and concat, does the loss function need to be changed? (I get the same exception.)

Please see this post for complete code.
https://discuss.pytorch.org/t/runtimeerror-function-catbackward-returned-an-invalid-gradient-at-index-1-expected-device-1-but-got-0/33958

Thanks for the help.