"arguments are located on different GPUs" at backward pass

I have seen other posts about this error, but theirs are different from mine.

after spending a whole night debugging, I located the error, but I can’t fit it (can’t figure out why it’s happening).

here is the minimum code to reproduce it ::

from easydl import *

feature_extractor = nn.Linear(10, 10)
classifier = nn.Linear(10, 10)
net = nn.Sequential(feature_extractor, classifier)
net = nn.DataParallel(net)

discriminator = nn.Sequential(
    # place 1
    GradientReverseModule(lambda step: aToBSheduler(step, 0.0, 1.0, gamma=10, max_iter=10000)),
discriminator = nn.DataParallel(discriminator)

op = optim.SGD(net.parameters(),lr=1)

for _ in range(2):
    with OptimizerManager(op):
        im_source = Variable(torch.from_numpy(np.random.rand(36, 10).astype(np.float32))).cuda()
        im_target = Variable(torch.from_numpy(np.random.rand(36, 10).astype(np.float32))).cuda()
        outs_source = net.forward(im_source)
        outs_target = net.forward(im_target)
        d_source = discriminator(outs_source)
        d_target = discriminator(outs_target)
        if len(sys.argv) > 1:
            # place 2
            loss = torch.sum(outs_source) + torch.sum(outs_target) + torch.sum(d_source) + torch.sum(d_source)
            # place 3
            loss = torch.sum(outs_source) + torch.sum(outs_target)
        loss = loss * loss.detach()

error happens at this line loss.backward().

there are 3 places that I marked in the code above.

I have made 2 observations:

  1. if code at place 1 is removed, no error is reported
  2. else, if I use place 2, I get an error of “arguments are located on different GPUs”. if I use place 3, no error is reported.

code at place 1 has documentation here . In short, It servers as identity mapping at forward pass and reverses the gradient at backward pass. the scheduler changes the coefficient of backward pass gradually. documentation about aToBSheduler is here

how can I fix it if I want to use code at place 1?