How to use the backward functions for multiple losses?

Hi,

I am playing with the DCGAN code in pytorch examples .

Replacing errD_real.backward() and errD_fake.backward() with errD.backward() after Line 236 results in failure (get the nonsense output) in the training.

where errD = errD_real + errD_fake, but errD.backward() is not equal to errD_real.backward() and errD_fake.backward()

I think both of them should give me the same gradients, but in fact it does not. Does anyone know where the issue comes from? If the model has multiple losses, do we really need to call the backward function for each loss rather than “sum them up and then call the total_loss.backward”?

10 Likes

To be specific, changes like the following will result in failure in training DCGAN:
output = netD(input)
errD_real = criterion(output, label)
errD_real.backward()
D_x = output.data.mean()
# train with fake
noise.data.resize_(batch_size, nz, 1, 1)
noise.data.normal_(0, 1)
fake = netG(noise)
label.data.fill_(fake_label)
output = netD(fake.detach())
errD_fake = criterion(output, label)
errD_fake.backward()
D_G_z1 = output.data.mean()
errD = errD_real + errD_fake
errD.backward()
optimizerD.step()

Does anyone know where the issue comes from?

I am wondering the same thing. Even while updating the Generator, if I have more terms in the Generator loss apart from the errG in the DCGAN example, should I call backward() for both the components separately or combine them and call backwards once?

What exactly happens on calling backwards() multiple times, does it calculate the gradients twice with respect to the same weights based on the criterion or does the second call calculate gradients on top of the gradients calculated in the first call.

Looking for answers!

Cheers,
Nabarun

3 Likes

I imagine the error that is thrown is from trying to backprop Variables that belong to different graph structures. errD_real belongs to the graph of netD(input) where errD_fake belongs to netD(netG(noise)) (though it has been detached to only affect netD) so you’re trying to get the derivative wrt different inputs into different functions.

Hello @Nabarun_Goswami,

to try to clear this up, in the DCGAN example, you have (think about mathematical functions here, I left out everything not relevant).

loss = criterion(netD(real, params))+criterion(netD(fake, params))

Spelling out the chain rule for the gradient of the loss w.r.t. the params:

params loss = ∇params netD(real, params)* ∇netD loss(netD (real,params)) + ∇params netD(fake, params)* ∇netD loss(netD (fake,params)),

note how ∇params netD is evaluated at two different points, namely (real, params) and (fake, params).

The way backpropagation works is to evaluate the gradients at the locations of the last forward pass.
In theory, you could also copy the network, make the parameters shared and then just add the loss to achieve the same. Then the backprop at real would go through one copy and the one at fake through the other.

Now, this is exactly why (I imagine, I didn’t design it) pytorch actually adds to the .grad on backward to allow the following:

  1. You zero gradients. (Ha, I forgot that often enough to need a benefit of needing to that myself.)
  2. You evaluate netD and criterion at the point real.
  3. You backprop to compute derivatives at the point real (=the last evaluated point). The .grads are added to zero from step 1.
  4. You evaluate netD and criterion at the point fake.
  5. You backprop to compute derivatives at the point fake (=the last evaluated point). The .grads are added to the .grads you had from step 3.

You have now computed the gradient of loss, but manually split it into the two summands.

If you just added the two parts to the loss and did backward, the netD would not know about step 2. anymore because step 4. overwrote things.

As seen in the Wasserstein GAN code and friends, you can also supply a -1 tensor to the .backward to emulate terms subtracted from the loss.

Now. If you just call .backward twice, there are two possibilities

  • with keep_graph=True (or keep_variables=True in pytorch <=0.1.12) in the first call, you will do the same as in 3 and five: You backprop twice to compute derivatives at the last evaluated point. The .grads are added to the .grads you had, so you end up computing twice the gradient at the last evaluated point.
  • without keep_graph=True in the first call, pytorch can throw away the processed bit’s information from forward, and will give an error meaning “forward info is gone, you used it and didn’t tell me to keep it”.

I hope this helps you to decide what is the fit for your project.

Best regards

Thomas

10 Likes

I think calling backward once on errD_real + errD_fake should be the same as calling it twice separately. The problem mainly attributes to reusing the same label tensor for real and fake data (https://github.com/pytorch/examples/blob/master/dcgan/main.py, L219, L232). If creating two label tensors, the problem should be resolved. Nevertheless, calling backward twice is recommended as it could save some GPU memory.

5 Likes

This is not correct. Pytorch will not overwrite previous values. Being the same module/parameter or not is not related to how the dynamic graph is built. It will be different links in the graph. The variables saved for backward pass will be stored differently. See the following simple example:

>>> l1 = nn.Linear(3, 3)
>>> l1.weight.data.fill_(0)
>>> l1.bias.data.fill_(0)
>>> x = Variable(torch.ones(2, 3))
>>>
>>> # backward one loss only
>>> loss1 = (l1(x) - 1).abs().sum()
>>> loss1.backward()
>>> l1.weight.grad
Variable containing:
-2 -2 -2
-2 -2 -2
-2 -2 -2
[torch.FloatTensor of size 3x3]

>>> 
>>> # backward the other loss only
>>> l1.weight.grad = None
>>> loss2 = (l1(x) + 1).abs().sum()
>>> loss2.backward()
>>> l1.weight.grad
Variable containing:
 2  2  2
 2  2  2
 2  2  2
[torch.FloatTensor of size 3x3]

>>> 
>>> # backward both losses together
>>> l1.weight.grad = None
>>> loss1 = (l1(x) - 1).abs().sum()
>>> loss2 = (l1(x) + 1).abs().sum()
>>> (loss1+loss2).backward()
>>> l1.weight.grad
Variable containing:
 0  0  0
 0  0  0
 0  0  0
[torch.FloatTensor of size 3x3]

11 Likes

Indeed, upon rereading this paragraph:

seems to be misleading (but I cannot edit that post anymore), thank you for pointing that out.

Indeed, what doing the backward in two pieces saves is the saving two calculation graphs at the same time, this has nothing to do with the network (weights).

Best regards

Thomas

3 Likes

Given that the changes made by the OP to DCGAN don’t change the flow of the gradients, what could be the reason for the nonsensical outputs he obtained?

2 Likes

Good catch!
It’s the in-place operation label.data.fill_(fake_label) that breaks the real part sub-graph.
Adding the link for reference.
https://pytorch.org/docs/stable/notes/autograd.html#in-place-operations-with-autograd

4 Likes

so what’s the real reason for this issue?

#%% test
import torch
import torch.nn as nn
#%%
layer = nn.Linear(3, 3)
layer.weight.data.fill_(0.1)
layer.weight.grad=None

x1 = torch.ones((3,3), requires_grad=True)
y1=layer(x1)
loss1= torch.sum(y1)
loss1.backward()
print(x1.grad)
print(layer.weight.grad)

x2 = torch.ones((3,3), requires_grad=True)
x2.data.fill_(0.2)
y2=layer(2*x2)
loss2= torch.sum(y2)
loss2.backward()
print(x2.grad)
print(layer.weight.grad)
#%%
‘’’
tensor([[0.3000, 0.3000, 0.3000],
[0.3000, 0.3000, 0.3000],
[0.3000, 0.3000, 0.3000]])
tensor([[3., 3., 3.],
[3., 3., 3.],
[3., 3., 3.]])
tensor([[0.6000, 0.6000, 0.6000],
[0.6000, 0.6000, 0.6000],
[0.6000, 0.6000, 0.6000]])
tensor([[4.2000, 4.2000, 4.2000],
[4.2000, 4.2000, 4.2000],
[4.2000, 4.2000, 4.2000]])
‘’’
#%%
layer = nn.Linear(3, 3)
layer.weight.data.fill_(0.1)

x1 = torch.ones((3,3), requires_grad=True)
x2 = torch.ones((3,3), requires_grad=True)
x2.data.fill_(0.2)

y1=layer(x1)
y2=layer(2*x2)

loss1= torch.sum(y1)
loss1.backward()
print(x1.grad)
print(layer.weight.grad)

loss2= torch.sum(y2)
loss2.backward()
print(x2.grad)
print(layer.weight.grad)
#%%
‘’’
tensor([[0.3000, 0.3000, 0.3000],
[0.3000, 0.3000, 0.3000],
[0.3000, 0.3000, 0.3000]])
tensor([[3., 3., 3.],
[3., 3., 3.],
[3., 3., 3.]])
tensor([[0.6000, 0.6000, 0.6000],
[0.6000, 0.6000, 0.6000],
[0.6000, 0.6000, 0.6000]])
tensor([[4.2000, 4.2000, 4.2000],
[4.2000, 4.2000, 4.2000],
[4.2000, 4.2000, 4.2000]])
‘’’
#%%
layer = nn.Linear(3, 3)
layer.weight.data.fill_(0.1)

x1 = torch.ones((3,3), requires_grad=True)
x2 = torch.ones((3,3), requires_grad=True)
x2.data.fill_(0.2)

y1=layer(x1)
y2=layer(2*x2)

loss1= torch.sum(y1)
loss2= torch.sum(y2)
loss=loss1+loss2
loss.backward()

print(x1.grad)
print(x2.grad)
print(layer.weight.grad)
#%%
‘’’
tensor([[0.3000, 0.3000, 0.3000],
[0.3000, 0.3000, 0.3000],
[0.3000, 0.3000, 0.3000]])
tensor([[0.6000, 0.6000, 0.6000],
[0.6000, 0.6000, 0.6000],
[0.6000, 0.6000, 0.6000]])
tensor([[4.2000, 4.2000, 4.2000],
[4.2000, 4.2000, 4.2000],
[4.2000, 4.2000, 4.2000]])
‘’’

Good catch!
So I think it’s the correct answer. Thanks.

Pointing out that keep_graph is now called retain_graph.

sorry, but may i ask why does it save memory?

Another question, do you mean the code in referenced link (L219,L232) is wrong because it can cause the problem (different result between twice backward and once backward)? But why…? why same label tensor can cause this difference?

The last thing, more important, i didn’t use any inplace operations in my code like label.fill_(fake_label), but i also encountered the problem that:
the training result of twice backward is different from add the loss together then use 1 backward…

the later one will cause nan gradient and nan loss in the training, i don’t know why…

every time, i input 1 pair (2 different input with same label) to my network, to be clear:

# method 1
out1 = model(input1)
loss1 = criterion(out1,label)
loss1.backward()
optimizer.step()
optimizer.zero_grad()

out2 = model(input2)
loss2 = criterion(out2,label)
loss2.backward()
optimizer.step()
optimizer.zero_grad()

# method 2
out1 = model(input1)
out2 = model(input2)
loss1 = criterion(out1,label)
loss2 = criterion(out2,label)
lossAdd = loss1 + loss2
lossAdd.backward()
optimizer.step()
optimizer.zero_grad()

in my experiment, method 1 and method 2 have different results in training, and method 2 can cause Nan problems after several training epochs (training can not converge). May i ask why does this happen?

is the reason that i used same label tesor?
is because input1 and input2 are too different from each other so can not add their loss together?

Very look forward to your reply…

@han @Nabarun_Goswami @tymokvo @tom @Cysu @SimonW @Sten_Sootla @Weifeng @Atcold @yingda.yin

sorry to interrupt so many people with so old post, but i really got stuck here for more than one month, really really need help here…

2 Likes

Not professional but in my opinion, what your method 1 doing is: ‘compute grad 1 → optimize along grad 1 → compute grad 2 → optimize along grad 2’, so when the second forward is performed the model params are already updated once, whereas in your method 2, it’s ‘compute grad 1 → compute and add grad 2 to grad 1 to form grad_total → optimize along grad_total’, there’s only one optimization step going towards the integrated gradient instead of two separated steps. I think this difference is like when you have two dataset, method 1 is alternatively feeding a batch of one dataset at a time and method 2 is mixing two batches one for each dataset and feeding them together.
As for why backward twice saves memory, because once the loss is backward, memories of tensors involved in computing its forward step are freed (that’s why you cannot backward twice with one forward) whereas if you were to accumulate the loss and backward together, the memories are still retained when performing the second forward step.

Hello @tom,

I’d have a quick follow-up comment to this part:

I took the code that @SimonW provided, and tested their memory consumption by moving both the input tensor x and the NN onto a CUDA device, and then simply doing torch.cuda.memory_allocated(device). For me, it turns out that the code with

loss = (l1(x) - 1).abs().sum() + (l1(x) + 1).abs().sum()
loss.backward()

consumes in total 3.0 kB (also accounting that x and the NN were moved onto the device), whereas

loss_first_term = (l1(x) - 1).abs().sum()
loss_first_term.backward()

loss_second_term = (l1(x) + 1).abs().sum()
loss_second_term.backward()

consumes 3.5 kB… It seems at odds with your statement, or had you meant it in terms of the running time?

Depending on what you are doing in l1 there will likely be lower order effects. For this function, you’d probably assign l1(x) to an intermediate and be done with it in one step…