How to use the backward functions for multiple losses?

han · April 13, 2017, 4:03am

Hi,

I am playing with the DCGAN code in pytorch examples .

Replacing errD_real.backward() and errD_fake.backward() with errD.backward() after Line 236 results in failure (get the nonsense output) in the training.

where errD = errD_real + errD_fake, but errD.backward() is not equal to errD_real.backward() and errD_fake.backward()

I think both of them should give me the same gradients, but in fact it does not. Does anyone know where the issue comes from? If the model has multiple losses, do we really need to call the backward function for each loss rather than “sum them up and then call the total_loss.backward”?

han · April 13, 2017, 7:27pm

To be specific, changes like the following will result in failure in training DCGAN:
output = netD(input)
errD_real = criterion(output, label)
~~errD_real.backward()~~
D_x = output.data.mean()
# train with fake
noise.data.resize_(batch_size, nz, 1, 1)
noise.data.normal_(0, 1)
fake = netG(noise)
label.data.fill_(fake_label)
output = netD(fake.detach())
errD_fake = criterion(output, label)
~~errD_fake.backward()~~
D_G_z1 = output.data.mean()
errD = errD_real + errD_fake
errD.backward()
optimizerD.step()

Does anyone know where the issue comes from?

Nabarun_Goswami · May 14, 2017, 4:22am

I am wondering the same thing. Even while updating the Generator, if I have more terms in the Generator loss apart from the errG in the DCGAN example, should I call backward() for both the components separately or combine them and call backwards once?

What exactly happens on calling backwards() multiple times, does it calculate the gradients twice with respect to the same weights based on the criterion or does the second call calculate gradients on top of the gradients calculated in the first call.

Looking for answers!

Cheers,
Nabarun

tymokvo · May 14, 2017, 5:48am

I imagine the error that is thrown is from trying to backprop Variables that belong to different graph structures. errD_real belongs to the graph of netD(input) where errD_fake belongs to netD(netG(noise)) (though it has been detached to only affect netD) so you’re trying to get the derivative wrt different inputs into different functions.

tom · May 14, 2017, 5:29pm

Hello @Nabarun_Goswami,

to try to clear this up, in the DCGAN example, you have (think about mathematical functions here, I left out everything not relevant).

loss = criterion(netD(real, params))+criterion(netD(fake, params))

Spelling out the chain rule for the gradient of the loss w.r.t. the params:

∇_params loss = ∇_params netD(real, params)* ∇_netD loss(netD (real,params)) + ∇_params netD(fake, params)* ∇_netD loss(netD (fake,params)),

note how ∇_params netD is evaluated at two different points, namely (real, params) and (fake, params).

The way backpropagation works is to evaluate the gradients at the locations of the last forward pass.
In theory, you could also copy the network, make the parameters shared and then just add the loss to achieve the same. Then the backprop at real would go through one copy and the one at fake through the other.

Now, this is exactly why (I imagine, I didn’t design it) pytorch actually adds to the .grad on backward to allow the following:

You zero gradients. (Ha, I forgot that often enough to need a benefit of needing to that myself.)
You evaluate netD and criterion at the point real.
You backprop to compute derivatives at the point real (=the last evaluated point). The .grads are added to zero from step 1.
You evaluate netD and criterion at the point fake.
You backprop to compute derivatives at the point fake (=the last evaluated point). The .grads are added to the .grads you had from step 3.

You have now computed the gradient of loss, but manually split it into the two summands.

If you just added the two parts to the loss and did backward, the netD would not know about step 2. anymore because step 4. overwrote things.

As seen in the Wasserstein GAN code and friends, you can also supply a -1 tensor to the .backward to emulate terms subtracted from the loss.

Now. If you just call .backward twice, there are two possibilities

with keep_graph=True (or keep_variables=True in pytorch <=0.1.12) in the first call, you will do the same as in 3 and five: You backprop twice to compute derivatives at the last evaluated point. The .grads are added to the .grads you had, so you end up computing twice the gradient at the last evaluated point.
without keep_graph=True in the first call, pytorch can throw away the processed bit’s information from forward, and will give an error meaning “forward info is gone, you used it and didn’t tell me to keep it”.

I hope this helps you to decide what is the fit for your project.

Best regards

Thomas

Cysu · June 15, 2017, 5:05am

I think calling backward once on errD_real + errD_fake should be the same as calling it twice separately. The problem mainly attributes to reusing the same label tensor for real and fake data (https://github.com/pytorch/examples/blob/master/dcgan/main.py, L219, L232). If creating two label tensors, the problem should be resolved. Nevertheless, calling backward twice is recommended as it could save some GPU memory.

SimonW · October 25, 2017, 3:55pm

This is not correct. Pytorch will not overwrite previous values. Being the same module/parameter or not is not related to how the dynamic graph is built. It will be different links in the graph. The variables saved for backward pass will be stored differently. See the following simple example:

>>> l1 = nn.Linear(3, 3)
>>> l1.weight.data.fill_(0)
>>> l1.bias.data.fill_(0)
>>> x = Variable(torch.ones(2, 3))
>>>
>>> # backward one loss only
>>> loss1 = (l1(x) - 1).abs().sum()
>>> loss1.backward()
>>> l1.weight.grad
Variable containing:
-2 -2 -2
-2 -2 -2
-2 -2 -2
[torch.FloatTensor of size 3x3]

>>> 
>>> # backward the other loss only
>>> l1.weight.grad = None
>>> loss2 = (l1(x) + 1).abs().sum()
>>> loss2.backward()
>>> l1.weight.grad
Variable containing:
 2  2  2
 2  2  2
 2  2  2
[torch.FloatTensor of size 3x3]

>>> 
>>> # backward both losses together
>>> l1.weight.grad = None
>>> loss1 = (l1(x) - 1).abs().sum()
>>> loss2 = (l1(x) + 1).abs().sum()
>>> (loss1+loss2).backward()
>>> l1.weight.grad
Variable containing:
 0  0  0
 0  0  0
 0  0  0
[torch.FloatTensor of size 3x3]

tom · October 26, 2017, 7:54pm

Indeed, upon rereading this paragraph:

seems to be misleading (but I cannot edit that post anymore), thank you for pointing that out.

Indeed, what doing the backward in two pieces saves is the saving two calculation graphs at the same time, this has nothing to do with the network (weights).

Best regards

Thomas

Sten_Sootla · July 20, 2018, 3:46pm

Given that the changes made by the OP to DCGAN don’t change the flow of the gradients, what could be the reason for the nonsensical outputs he obtained?

Weifeng · February 25, 2019, 12:37pm

Good catch!
It’s the in-place operation label.data.fill_(fake_label) that breaks the real part sub-graph.
Adding the link for reference.
https://pytorch.org/docs/stable/notes/autograd.html#in-place-operations-with-autograd

STU · March 22, 2019, 1:18am

so what’s the real reason for this issue?

liangbright · April 12, 2019, 9:49pm

#%% test
import torch
import torch.nn as nn
#%%
layer = nn.Linear(3, 3)
layer.weight.data.fill_(0.1)
layer.weight.grad=None

x1 = torch.ones((3,3), requires_grad=True)
y1=layer(x1)
loss1= torch.sum(y1)
loss1.backward()
print(x1.grad)
print(layer.weight.grad)

x2 = torch.ones((3,3), requires_grad=True)
x2.data.fill_(0.2)
y2=layer(2*x2)
loss2= torch.sum(y2)
loss2.backward()
print(x2.grad)
print(layer.weight.grad)
#%%
‘’’
tensor([[0.3000, 0.3000, 0.3000],
[0.3000, 0.3000, 0.3000],
[0.3000, 0.3000, 0.3000]])
tensor([[3., 3., 3.],
[3., 3., 3.],
[3., 3., 3.]])
tensor([[0.6000, 0.6000, 0.6000],
[0.6000, 0.6000, 0.6000],
[0.6000, 0.6000, 0.6000]])
tensor([[4.2000, 4.2000, 4.2000],
[4.2000, 4.2000, 4.2000],
[4.2000, 4.2000, 4.2000]])
‘’’
#%%
layer = nn.Linear(3, 3)
layer.weight.data.fill_(0.1)

x1 = torch.ones((3,3), requires_grad=True)
x2 = torch.ones((3,3), requires_grad=True)
x2.data.fill_(0.2)

y1=layer(x1)
y2=layer(2*x2)

loss1= torch.sum(y1)
loss1.backward()
print(x1.grad)
print(layer.weight.grad)

loss2= torch.sum(y2)
loss2.backward()
print(x2.grad)
print(layer.weight.grad)
#%%
‘’’
tensor([[0.3000, 0.3000, 0.3000],
[0.3000, 0.3000, 0.3000],
[0.3000, 0.3000, 0.3000]])
tensor([[3., 3., 3.],
[3., 3., 3.],
[3., 3., 3.]])
tensor([[0.6000, 0.6000, 0.6000],
[0.6000, 0.6000, 0.6000],
[0.6000, 0.6000, 0.6000]])
tensor([[4.2000, 4.2000, 4.2000],
[4.2000, 4.2000, 4.2000],
[4.2000, 4.2000, 4.2000]])
‘’’
#%%
layer = nn.Linear(3, 3)
layer.weight.data.fill_(0.1)

x1 = torch.ones((3,3), requires_grad=True)
x2 = torch.ones((3,3), requires_grad=True)
x2.data.fill_(0.2)

y1=layer(x1)
y2=layer(2*x2)

loss1= torch.sum(y1)
loss2= torch.sum(y2)
loss=loss1+loss2
loss.backward()

print(x1.grad)
print(x2.grad)
print(layer.weight.grad)
#%%
‘’’
tensor([[0.3000, 0.3000, 0.3000],
[0.3000, 0.3000, 0.3000],
[0.3000, 0.3000, 0.3000]])
tensor([[0.6000, 0.6000, 0.6000],
[0.6000, 0.6000, 0.6000],
[0.6000, 0.6000, 0.6000]])
tensor([[4.2000, 4.2000, 4.2000],
[4.2000, 4.2000, 4.2000],
[4.2000, 4.2000, 4.2000]])
‘’’

yingda.yin · April 29, 2019, 12:50pm

Good catch!
So I think it’s the correct answer. Thanks.

Atcold · August 28, 2019, 4:10am

Pointing out that keep_graph is now called retain_graph.

karl7 · May 7, 2021, 5:40am

sorry, but may i ask why does it save memory?

Another question, do you mean the code in referenced link (L219,L232) is wrong because it can cause the problem (different result between twice backward and once backward)? But why…? why same label tensor can cause this difference?

The last thing, more important, i didn’t use any inplace operations in my code like label.fill_(fake_label), but i also encountered the problem that:
the training result of twice backward is different from add the loss together then use 1 backward…

the later one will cause nan gradient and nan loss in the training, i don’t know why…

every time, i input 1 pair (2 different input with same label) to my network, to be clear:

# method 1
out1 = model(input1)
loss1 = criterion(out1,label)
loss1.backward()
optimizer.step()
optimizer.zero_grad()

out2 = model(input2)
loss2 = criterion(out2,label)
loss2.backward()
optimizer.step()
optimizer.zero_grad()

# method 2
out1 = model(input1)
out2 = model(input2)
loss1 = criterion(out1,label)
loss2 = criterion(out2,label)
lossAdd = loss1 + loss2
lossAdd.backward()
optimizer.step()
optimizer.zero_grad()

in my experiment, method 1 and method 2 have different results in training, and method 2 can cause Nan problems after several training epochs (training can not converge). May i ask why does this happen?

is the reason that i used same label tesor?
is because input1 and input2 are too different from each other so can not add their loss together?

Very look forward to your reply…

@han @Nabarun_Goswami @tymokvo @tom @Cysu @SimonW @Sten_Sootla @Weifeng @Atcold @yingda.yin

sorry to interrupt so many people with so old post, but i really got stuck here for more than one month, really really need help here…

B.W_Zhang · November 2, 2021, 7:00am

Not professional but in my opinion, what your method 1 doing is: ‘compute grad 1 → optimize along grad 1 → compute grad 2 → optimize along grad 2’, so when the second forward is performed the model params are already updated once, whereas in your method 2, it’s ‘compute grad 1 → compute and add grad 2 to grad 1 to form grad_total → optimize along grad_total’, there’s only one optimization step going towards the integrated gradient instead of two separated steps. I think this difference is like when you have two dataset, method 1 is alternatively feeding a batch of one dataset at a time and method 2 is mixing two batches one for each dataset and feeding them together.
As for why backward twice saves memory, because once the loss is backward, memories of tensors involved in computing its forward step are freed (that’s why you cannot backward twice with one forward) whereas if you were to accumulate the loss and backward together, the memories are still retained when performing the second forward step.

Imahn · April 14, 2022, 4:44pm

Hello @tom,

I’d have a quick follow-up comment to this part:

I took the code that @SimonW provided, and tested their memory consumption by moving both the input tensor x and the NN onto a CUDA device, and then simply doing torch.cuda.memory_allocated(device). For me, it turns out that the code with

loss = (l1(x) - 1).abs().sum() + (l1(x) + 1).abs().sum()
loss.backward()

consumes in total 3.0 kB (also accounting that x and the NN were moved onto the device), whereas

loss_first_term = (l1(x) - 1).abs().sum()
loss_first_term.backward()

loss_second_term = (l1(x) + 1).abs().sum()
loss_second_term.backward()

consumes 3.5 kB… It seems at odds with your statement, or had you meant it in terms of the running time?

tom · April 14, 2022, 5:09pm

Depending on what you are doing in l1 there will likely be lower order effects. For this function, you’d probably assign l1(x) to an intermediate and be done with it in one step…