How does loss.backward() calculated in pytorch lightning

Xanthan · April 15, 2024, 3:25pm

I understand that when training_step() returns loss, the code for automatic optimization
(link) takes care of the loss.backward()

Can someone tell me what would be difference in the loss.backward() automatic optimization, for the following two scenarios for training_step():

Scenario 1:

def training_step(self, batch: list,epochidx):
     x,y = batch
     output = model(x)
     loss = self.loss_func(output,y)

     return loss

Scenario 2:

def training_step(self, batch: list,epochidx):
     x,y = batch
     output = model(x)
     loss = self.loss_func(output,y)
     metric = self.metric(output,y)

     train_log = {"loss":loss,"metric":metric}

     return train_log

What my worry is that loss.backward() in the 2nd scenario does backward for both loss and metric instead of just loss.

I opened the pytorch-lightning files in my conda environment to understand how the automatic optimization is happening if I send a dictionary instead of a Tensor but it didn’t lead to much.

Any help/hint is appreciated. Thanks!

ptrblck · April 15, 2024, 6:50pm

Since this issue is PyTorch-Lightning-specific you might want to cross-post the question into their discussion board.

Xanthan · April 15, 2024, 7:08pm

sure, I will put it their as well. thanks!

I would like to know in your experience, did a model training yield different results when tried with scenario 2 (compared with scenario 1)

NilsB · April 15, 2024, 7:57pm

When you return a dict in the “training_step” method pytorch lightning will look specifically for the “loss” key in the dict for the automatic optimization and use only this for the backprop.
It’s not that clearly stated, but you can also find it here in the docs.
So there wouldn’t be any differences in regard to optimization between your two scenarios.

Xanthan · April 16, 2024, 5:55am

@NilsB , thanks for the amazing find. This is exactly what I wanted to know.

So, like you said and what’s mentioned in your link, there should ideally be no difference in training when encountered with scenario 1 and scenario 2.