Why does merging all loss in a batch make sense?

I built a vanilla GRU net to resolve a binary classification problem, and train it with a batch size greater than 1.

Let’s say my NN model got a correct result with sample-0, and got a incorrect result with sample-1 in a same batch, when training.

When torch.nn.BCELoss(for example) was used to calculate the loss, it simply added all of losses into a scalar(not a matrix), and then this scalar loss(sum or mean) was propagated backward all neurons of the net, and every weights would be adjusted according to this scalar.

But in my view, the correct classifying behavior should be encouraged, and the incorrect behavior should be punished, instead of blindly adjust all the weights according to a single loss value.

What magic does PyTorch do in the background?

I’m a freshman in NN and DL. Thanks for your patience!


To get an intuition, I would suggest that you work through the mathematical equations of how these losses are calculated.

The loss of a mini-batch is an estimate of the expectation of the loss with respect to the true data-generating function. In this manner, mini-batches of larger size provide a more accurate (in general) estimate of true gradient descent, which would take the entire dataset into account for each step. Clearly its a pretty inefficient computation to take the entire dataset, hence the mini-batch approach. It isn’t “blindly” adjusting the weights according to a single value, its taking samples from the population and seeing in what overall direction would minimize the loss. This is more of a math/optimization question than a PyTorch specific question. PyTorch takes the mean of the loss per batch element, not the sum.

You can experiment with a single batch element and you will see that it might take more epochs to converge, and worse performance on the test set. Gradients from a single example are noisy. Statistics tends to reward more samples, and I would argue that deep learning is a kind of applied statistics. The relationship between batch size, learning rate, and optimizer is unfortunately a kind of trial and error according to my experience. There are papers written giving optimal choices for simple/distinct scenarios, but it’s hard to generalize concretely. For classification I would say try to use the biggest batch size you can. Also in your example the correct classifying behavior would be rewarded with a small gradient, while the incorrect prediction would be punished with a large gradient, in terms of the norm.

Thanks for you help!

Indeed I have experimented with a single sample each batch (i.e. batch size = 1), and I got a similar training result.
Of course it need much more time to run through all the data, because it can not take advantage of the parallelism of GPU.

As to the speed of convergence you mentioned, I have not observed a obvious difference comparing to a larger batch size.

I have this doubt because the loss of a batch is a scalar value, it can NOT be a flag to point out which classifying(predicting) is good, and which classifying is bad.

For example, there is only neuron in our net, that has a positive derivation of the input. If a classifying get correct result with sample-0, the weight in this neuron should be enlarged(or left unchanged at least).
But at same time, in same batch it got a incorrect result with sample-1, that would result a larger loss value. Then the weight would be decreased regardless whether or not the classifying results were correct. Is this not we wanted?

Can you give me an example of your network, is it basically just logistic regression (i.e. a linear layer with n inputs, 1 output, followed by a sigmoid operation trained with binary cross entropy)?