Compute the whole gradient of a mini-batch using the accumulated gradient of small mini-batches


My doubt is related to the possibility of compute the whole gradient of a mini-batch using the accumulated gradient of small mini-batches(I named this technique batch partition I do not know if it is the common name ). As is explained here Methods 1 and 2 should give the same results, right? I did some tests that you can find in this colab. As you can see the results are different. In addition, I tested a modification version that averages the accumulated gradient by the number of partitions of the mini-batch, but it does not help.

It is expected this behaviour? If it is not, can you please me give some hint what am I doing wrong? If it is expected, can you help me to understand why is it expected? As far as I understood the results should be the sameā€¦


The default loss reduction seems to be mean [docs]. Perhaps set it to sum and divide by total batch size in both cases will get you the same gradient?

It should be the same, right? I mean, if you have the numbers 4 7 12 15. The mean of all of them is the same if you compute the sum of all of them and divide by the number of elements or if you compute first the mean of the first numbers, following the mean of the last numbers and the mean of both.

In addition, I am accumulating the gradients, the loss is used in each small mini-batch. As is explained here.

Sorry for insist, but I would like to know if this behavior is expected and if not how I can solve the issue. I hope someone can help me. Thanks.