Dear all,
My data is shaped as (9, n_features, n_timesteps), and I would like to slice it into 9 parts, and feed each part into a simple CNN.
For normal usage,
-Does it make a difference if I backpropagate 9 losses separately to each CNN, versus summing up into 1 loss and backprop?
For negative correlation learning:
-Negative correlation learning encourages diversity among each network. Would it make a difference if I backprop 9 losses vs just 1 loss?
Thank you!
Best
Hi,
Theoretically, the gradients calculated using a summed loss = sum of gradients from the separate terms (the terms that go into the summed loss).
In pytorch too, the gradients are accumulated over several backward calls unless you call zero_grad
so the result should be (theoretically) same, though you might see some differences due to precision etc.
The difference shouldn’t be large enough according to me.
Backpropagating 9 losses vs 1 loss -
I assumed you are essentially wanting to know which one between 9 optimizer steps (one after each backward call) and 1 optimizer step (after a single backward call) is better.
This depends on the task at hand. Generally, we prefer several optimizer steps to enable faster training.
Hence, this isn’t a PyTorch specific question. You might want to read up more on negative correlation learning to get a better idea.
Feel free to post further queries.
Best,
S