On backpropogating multiple losses

Xiaohao_Lin · January 7, 2023, 9:31am

Dear all,

My data is shaped as (9, n_features, n_timesteps), and I would like to slice it into 9 parts, and feed each part into a simple CNN.

For normal usage,
-Does it make a difference if I backpropagate 9 losses separately to each CNN, versus summing up into 1 loss and backprop?

For negative correlation learning:
-Negative correlation learning encourages diversity among each network. Would it make a difference if I backprop 9 losses vs just 1 loss?

Thank you!

Best

srishti-git1110 · January 7, 2023, 9:56am

Hi,
Theoretically, the gradients calculated using a summed loss = sum of gradients from the separate terms (the terms that go into the summed loss).

In pytorch too, the gradients are accumulated over several backward calls unless you call zero_grad so the result should be (theoretically) same, though you might see some differences due to precision etc.
The difference shouldn’t be large enough according to me.

Backpropagating 9 losses vs 1 loss -
I assumed you are essentially wanting to know which one between 9 optimizer steps (one after each backward call) and 1 optimizer step (after a single backward call) is better.

This depends on the task at hand. Generally, we prefer several optimizer steps to enable faster training.
Hence, this isn’t a PyTorch specific question. You might want to read up more on negative correlation learning to get a better idea.

Feel free to post further queries.
Best,
S