Access to gradient after call loss.backward()

Yoo · October 21, 2022, 4:03pm

Suppose I have loss = w1loss1 + w2loss2, where I defined two learnable weights for each loss

Weightloss1 = torch.FloatTensor([1]).clone().detach().requires_grad_(True)
Weightloss2 = torch.FloatTensor([1]).clone().detach().requires_grad_(True)

opt1 = torch.optim.Adam(model.parameters(), ...)
opt2 = torch.optim.Adam([Weightloss1, Weightloss2], ...)

# training
While True:
    model.train()
    for X, Y in train_set:
        pred_Y = model(X)
        loss_1 = Weightloss1 * model.loss_fn_1(pred_Y, Y)
        loss_2 = Weightloss2 * model.loss_fn_2(pred_Y, Y)
        loss = torch.div(torch.add(loss_1,loss_2), 2)

        opt_1.zero_grad()
        opt_2.zero_grad()
        loss.backward(retain_graph=True)

        opt_1.step()
        opt_2.step()

My question is, when I call loss.backward(retain_graph=True), will pytorch calculate gradients w.r.t. w1 and w2 in addition to model parameters? If so how can I get access to them?

Also, does the order of updating step() call matter? I do not believe so though

ptrblck · October 21, 2022, 6:27pm

Yes, all trainable parameters will receive gradients which you could access via their .grad attribute. Using retain_graph=True is not needed in this case.

Yoo · October 22, 2022, 5:27pm

Thanks, for the previous code, if I additionally add an another loss, loss_3, for params [Weightloss1, Weightloss2] which I set to update in optimizer2, something like

    pred_Y = model(X)
    loss_1 = Weightloss1 * model.loss_fn_1(pred_Y, Y)
    loss_2 = Weightloss2 * model.loss_fn_2(pred_Y, Y)
    loss = torch.div(torch.add(loss_1,loss_2), 2)
    loss_3 = loss_fn_3(Weightloss1 + Weightloss2, target_Weightloss)
    opt_1.zero_grad()
    opt_2.zero_grad()
    loss.backward()
    loss_3.backward()
    opt_1.step()
    opt_2.step()

If I call opt_2.step(), will it use both collected gradients w.r.t. Weightloss from loss and loss_3 to update tunable parameter [Weightloss1, Weightloss2]?

ptrblck · October 24, 2022, 6:55pm

Gradients are accumulated into the .grad attribute of trainable and used parameters. The optimizer’s step() method will thus use the already accumulated gradients in their corresponding parameters.