My doubt is related to the possibility of compute the whole gradient of a mini-batch using the accumulated gradient of small mini-batches(I named this technique batch partition I do not know if it is the common name ). As is explained here Methods 1 and 2 should give the same results, right? I did some tests that you can find in this colab. As you can see the results are different. In addition, I tested a modification version that averages the accumulated gradient by the number of partitions of the mini-batch, but it does not help.
It is expected this behaviour? If it is not, can you please me give some hint what am I doing wrong? If it is expected, can you help me to understand why is it expected? As far as I understood the results should be the sameā¦
It should be the same, right? I mean, if you have the numbers 4 7 12 15. The mean of all of them is the same if you compute the sum of all of them and divide by the number of elements or if you compute first the mean of the first numbers, following the mean of the last numbers and the mean of both.
In addition, I am accumulating the gradients, the loss is used in each small mini-batch. As is explained here.