Invalid Gradient during Backward Pass

I got this error when doing the backward pass:

RuntimeError: Function ThAddmmBackward returned an invalid gradient at index 1 - expected device 1 but got 0

This occurs when executing loss.backward(), where

ad_loss = torch.nn.BCELoss().cuda()
class_loss = torch.nn.CrossEntropyLoss().cuda()
a_loss = ad_loss(ad_out, ad_label)
c_loss = class_loss(s_output, s_target_var)
loss = c_loss + a_loss

The pytorch version is: 1.0.0a0+14004cb which is the current version on Master branch. The python version is: 2.7.15. And I am using 2 GPUs. Any idea is appreciated. Thanks in advance!


Do you use dataparallel? Could you give a small code sample that we could run to reproduce the problem please?

Hi albanD,

Thank you for your reply. I do use dataparallel with:

model = torch.nn.DataParallel(model, device_ids=args.gpus).cuda()

And this is the code snippet for computing loss and doing backward prop:

out_dict = model(, t_input_var), dim=0))
output, ad_out = out_dict['output'], out_dict['ad_out']
split_size = args.batch_size // len(args.gpus)
ad_label = []
for i in range(len(args.gpus)):
ad_label =, dim=0)
ad_label = torch.autograd.Variable(ad_label.cuda(async=True))
a_loss = ad_loss(ad_out, ad_label)
ad_losses.update(a_loss.item(), s_input.size(0)+t_input.size(0))
s_output = output[0:args.batch_size, :]
c_loss = class_loss(s_output, s_target_var)
_, s_pred = torch.max(s_output, dim=1)

# measure accuracy and record loss
prec1 = accuracy(, s_target.cpu().long(), lmap)
c_losses.update(c_loss.item(), s_input.size(0))
ad_losses.update(a_loss.item(), s_input.size(0))
top1.update(prec1, s_input.size(0))

loss = c_loss + a_loss
# compute gradient and do SGD step


But since the model definition is in another separate file I’m not sure how to give you an executable code.

This is not really easily readable code… A smaller example that reproduces just this issue would be more helpful
I would check that, t_input_var), dim=0) this op actually returns enough sample for the data parallel to work wiht? Or is it explicitly done such that s_* run on one gpu and t_* runs on the other and you always have 2 gpus?

Hi albanD,

I’m new to PyTorch. After I printed out s_input_var and t_input_var it seems that they are both on the same GPU with device='cuda:0'. Do they need to be on different GPUs? They are calculated in the following way:

s_input, s_target =
t_input, _ =
s_target = s_target.cuda(async=True)
s_input_var = torch.autograd.Variable(s_input).cuda()
t_input_var = torch.autograd.Variable(t_input).cuda()
s_target_var = torch.autograd.Variable(s_target)

Thanks very much!

I have a similar issue. The question you are asking about parts use different GPUs is the case in mine. (I don’t use data parallelism).
If I use different GPUs and concat, does the loss function need to be changed? (I get the same exception.)

Please see this post for complete code.

Thanks for the help.