How to do backward for the distributed joint training?

Hi all, I was trying to distribute several models (say 8) into 8 gpu devices. For each model models[i], I calculated the sum of output as follows:

for i in range(args.num_models):
    input_var ="cuda:%d" % (i))
    target_var ="cuda:%d" % (i))
    criterion = nn.CrossEntropyLoss().to("cuda:%d" % (i))
    models[i] = models[i].to("cuda:%d" % (i))
    loss += criterion(models[i](input_var), target_var)


But some error occurred:

RuntimeError: Function AddBackward0 returned an invalid gradient at index 1 - expected device cuda:1 but got cuda:0

Any suggestions? Thanks!

Try to push the losses to the same device before accumulating them.
Also, which PyTorch version are you using, as I cannot reproduce this issue locally (and also thought it would be fixed by now).

Thank you so much! Problem solved :slight_smile: