Suppose I have loss = w1loss1 + w2loss2, where I defined two learnable weights for each loss

Weightloss1 = torch.FloatTensor([1]).clone().detach().requires_grad_(True)
Weightloss2 = torch.FloatTensor([1]).clone().detach().requires_grad_(True)
opt1 = torch.optim.Adam(model.parameters(), ...)
opt2 = torch.optim.Adam([Weightloss1, Weightloss2], ...)
# training
While True:
model.train()
for X, Y in train_set:
pred_Y = model(X)
loss_1 = Weightloss1 * model.loss_fn_1(pred_Y, Y)
loss_2 = Weightloss2 * model.loss_fn_2(pred_Y, Y)
loss = torch.div(torch.add(loss_1,loss_2), 2)
opt_1.zero_grad()
opt_2.zero_grad()
loss.backward(retain_graph=True)
opt_1.step()
opt_2.step()

My question is, when I call loss.backward(retain_graph=True), will pytorch calculate gradients w.r.t. w1 and w2 in addition to model parameters? If so how can I get access to them?

Also, does the order of updating step() call matter? I do not believe so though

Yes, all trainable parameters will receive gradients which you could access via their .grad attribute. Using retain_graph=True is not needed in this case.

Thanks, for the previous code, if I additionally add an another loss, loss_3, for params [Weightloss1, Weightloss2] which I set to update in optimizer2, something like

If I call opt_2.step(), will it use both collected gradients w.r.t. Weightloss from loss and loss_3 to update tunable parameter [Weightloss1, Weightloss2]?

Gradients are accumulated into the .grad attribute of trainable and used parameters. The optimizerâ€™s step() method will thus use the already accumulated gradients in their corresponding parameters.