Huge difference in gradient performing the back propagation sample by sample or with the whole batch

Hi,
I noticed that working with a “big” network- as VGG16 there is a big difference in the gradient’s magnitude that I get performing the back-propagation sample by sample and the one that I get calculating it on the whole batch in a single time.

I don’t expect this difference since I’m using the cross entropy loss function with the default reduction=‘mean’ option. Also if the option would be set on sum, since I used a batch size of 30 I would have expected a difference of magnitude of this order, while the observed factor gap in the 2 cases is ~100000.

Also taking a batch of 2 element shows a huge difference with respect to a batch of a single element.

Why do I observe such a gap in the 2 cases? Could it be related to the presence of nn.BatchNorm2d?

Personally, I would expect that with reduction='mean' you would have to take the mean over the gradients you get from individual backpropagation.
If you use VGG, I’m assuming this is VGG without batch norm?

Best regards

Thomas

1 Like

Hi Tom, thank you for your comment! I also was expecting this.
My VGG has actually the batch norm in his architecture, I think that this would be the issue.
One question: Does BatchNorm gives good result with very small batch size (different from 1), like batch_size=2?

Does BatchNorm gives good result with very small batch size (different from 1), like batch_size=2

No. BatchNorm needs rather large batches because it essentially assumes that the current batch statistics are a reasonable estimate for the overall dataset statistics.
People have tried to mitigate this by modifying BN training mode into “update the statistics, then use the statistics as in eval mode” but this is quite finicky, in particular at the beginning.
This is why everyone is glad that transformers use layernorm instead. :slight_smile:

Best regards

Thomas

1 Like