It is possible for these two approaches to produces results that differ
by a small floating-point round-off error. The sequence of additions
could be performed in differing orders that would be mathematically
equivalent, but could differ (by a small round-off error) numerically.
Neither approach is better than the other, and they are, in some reasonable
sense, equivalent, even if they produce results that differ.
I would phrase it like this: By some unlucky happenstance it is possible
that the round-off error in one approach nudges the training into a less
desirable path than in the other. This is a little unlikely, but possible.
But if you rerun your training multiple times starting with, for example,
different random parameter initializations, or select your training batches
with different random samplings from your training set, the two approaches
should produce statistically equivalent sets of results.
Whether some specific bit of round-off error happens to nudge you down a
better or worse training path will get averaged away over multiple training
runs.
Thank you for replying. Everythings quite clear now. I also want to think that way but the first results difference quite extreme (weird unlucky moment I guess). It took a whole day to rerun my project and the project also have stochastic augmentation element to it. But after getting 3 results for each method. I couldn’t see any obvious pattern. Both methods have the same performance range. Again, thank you for your time. I really appreciate it.