How are optimizer.step() and loss.backward() related?

What do you mean by saying the following sentence? “this assumes the different loss functions are used to compute grads for the same parameters”.

And what does a GAN-like situation refer to?

I am a little confused and not sure when to use (loss1 + loss2).backward() or loss1.backward(retain_graph=True) loss2.backward(). Why is one more efficient than the other? Are these two methods mathematically equivalent?

According to Sam Bobel in Stack Overflow - What does the parameter retain_graph mean in the Variable’s backward() method? , these two methods are not mathematically equivalent if one uses adaptive gradient optimizers like ADAM, as shown below:

Do you agree with Sam Bobel?

2 Likes