Is the loss function paralleled when using DataParallel?

el3ment · May 22, 2017, 5:19pm

When using DataParallel to wrap my module, do I need to do anything to also parallelize the loss functions?

For example, let’s say that I have large batch size and large output tensors to compute MSE against a target. This operation would benefit from splitting the batch across multiple GPUs, but I’m not sure if the following code does that:

model = MyModule()
model = nn.parallel.DataParallel(model, device_ids=range(args.number_gpus))
model.cuda()

output = model(data)
criterion = nn.MSELoss()
criterion.cuda()
loss = criterion(output, target)
loss.backward()
optimizer.step()

This is a simplification based on imagenet example.

In my case, I have a much bigger custom loss module that includes some calls to a VGG network to estimate perceptual loss, and I’m not sure if I am maximizing performance. I tried computing loss as part of the forward function in MyModule, but this led to recursion errors during the backward step.

smth · May 28, 2017, 5:27pm

you can wrap the loss function inside a DataParallel too if you’d like.

el3ment · May 28, 2017, 5:39pm

Would that result in one unnecessary scatter/gather?

smth · May 28, 2017, 5:48pm

it would. if you’re worried about that you can put your DataParallel around your model + loss function.
But depending on how many parameters you have in your fully connected layer, it might not work out in terms of speed.

dragen · January 22, 2018, 8:33am

Hi, If i wrap the loss within forward, let’s say I run on 2 gpus, the forward function will return with [loss1, loss2], and should I sum over the loss1 + loss2 and then backward?
or loss1.backward(), loss2.backward()?

smth · January 22, 2018, 3:31pm

the forward function will receive output = torch.cat([loss1, loss2]), so you can do output.backward(torch.ones(2))

GriffinLiang · May 10, 2018, 9:29am

If output is [loss1, loss2], can I get the final loss as output.sum() ? And then do loss.backward().

henrique · December 30, 2018, 9:14am

Why wouldn’t you just get the mean like most loss functions do on regular batches?
i.e.

loss = criterion(output, target).mean()
loss.backward()
seems to work fine

halahup · April 9, 2019, 10:53pm

What does this do, could you please elaborate?

smth · April 10, 2019, 5:32am

we are passing a gradient of ones to the backward. Usually, if it’s a scalar output loss, and you do loss.backward(), it’s implied that it’s loss.backward(torch.ones(1)). Because, in this case the loss is actually two elements, output.backward() will give an error asking for gradients.

Dexter_JU · August 2, 2019, 3:28pm

I would have the same question as the guys before,
Can I get the final loss by output.sum() and then do loss.backward()? (I saw some blog posts doing that way.)
Is that different from what you suggested here?

sunshineatnoon · September 3, 2019, 6:40pm

Same question here, did you find any difference between using sum() and loss.backward(torch.ones(2))?