My doubt is related to the possibility of compute the whole gradient of a mini-batch using the accumulated gradient of small mini-batches(I named this technique batch partition I do not know if it is the common name ). As is explained here Methods 1 and 2 should give the same results, right? I did some tests that you can find in this colab. As you can see the results are different. In addition, I tested a modification version that averages the accumulated gradient by the number of partitions of the mini-batch, but it does not help.
It is expected this behaviour? If it is not, can you please me give some hint what am I doing wrong? If it is expected, can you help me to understand why is it expected? As far as I understood the results should be the same…