Does iterative inplace addition on multiple losses have different gradient value than just adding everything at once?

Chrystian · June 11, 2022, 6:05am

I am a bit confused

Let say I have 4 different losses.

losses_list = [loss1, loss2, loss3, loss4]

and as usual I want to add them together. Does the following

final_loss = 0
for loss in losses_list:
    final_loss += loss

gives different grad result than doing the following below?

final_loss = loss1 + loss2 + loss3 + loss4

My current project perform poorly on the first case, and gives a better results on the latter, or am I just imagining things? and both are equals?

KFrank · June 12, 2022, 2:51am

Hi Chrystian!

It is possible for these two approaches to produces results that differ
by a small floating-point round-off error. The sequence of additions
could be performed in differing orders that would be mathematically
equivalent, but could differ (by a small round-off error) numerically.

Neither approach is better than the other, and they are, in some reasonable
sense, equivalent, even if they produce results that differ.

I would phrase it like this: By some unlucky happenstance it is possible
that the round-off error in one approach nudges the training into a less
desirable path than in the other. This is a little unlikely, but possible.

But if you rerun your training multiple times starting with, for example,
different random parameter initializations, or select your training batches
with different random samplings from your training set, the two approaches
should produce statistically equivalent sets of results.

Whether some specific bit of round-off error happens to nudge you down a
better or worse training path will get averaged away over multiple training
runs.

Best.

K. Frank

Chrystian · June 12, 2022, 3:28pm

Thank you for replying. Everythings quite clear now. I also want to think that way but the first results difference quite extreme (weird unlucky moment I guess). It took a whole day to rerun my project and the project also have stochastic augmentation element to it. But after getting 3 results for each method. I couldn’t see any obvious pattern. Both methods have the same performance range. Again, thank you for your time. I really appreciate it.