How are optimizer.step() and loss.backward() related?

SunHaozhe · August 18, 2020, 1:36pm

What do you mean by saying the following sentence? “this assumes the different loss functions are used to compute grads for the same parameters”.

And what does a GAN-like situation refer to?

I am a little confused and not sure when to use (loss1 + loss2).backward() or loss1.backward(retain_graph=True) loss2.backward(). Why is one more efficient than the other? Are these two methods mathematically equivalent?

According to Sam Bobel in Stack Overflow - What does the parameter retain_graph mean in the Variable’s backward() method? , these two methods are not mathematically equivalent if one uses adaptive gradient optimizers like ADAM, as shown below:

Do you agree with Sam Bobel?