So, when the first half of my EACH and EVERY training mini-batches are from class C0 and the second from class C1, my neural net trains up to 60% accuracy and then crashes to flat 50%! I finally figured out it is because my batches are ordered. For instance, they look like (C0, C0, C0, C1, C1, C1). To resolve this, the input tensor to the model must be shuffled, (C0, C1, C0, C1, C1, C0). Why would this matter? Doesn’t the weight/learning update step stem from the average of the losses of datapoints in a mini-batch? If that’s true, why would the order of classes in the training minibatch matter so much?
Because the estimate of the expected gradients is biased if your data is ordered.
Have a look at Chapter 8 - Deep Learning book:
It is also crucial that the minibatches be selected randomly. Computing an unbiased estimate of the expected gradient from a set of samples requires that those samples be independent. We also wish for two subsequent gradient estimates to be independent from each other, so two subsequent minibatches of examples should also be independent from each other.
I can’t explain it that well as Ian Goodfellow does, but I understand it from a point of view, where you force your model to a minimum for the current class. All weight updates aim towards the minimum of class0, until you finally provide samples for class1. This minimum of class1 might be slightly different. If your parameters are already in a “class0 valley” it might be hard for your model to escape it.
That’s probably a very shallow view on the training and loss function, but it helps me visualizing, what could be happening.
@ptrblck, thanks for replying but I think you misunderstood my question. My dataLoader does NOT load all class0 first, then all class1. Each mini-batch contains class0 AND class1, something like this (C0, C0, C0, C1, C1, C1). The weights should be updated based on the average loss. However, this doesn’t result in proper training. Rather, they must be shuffled, e.g. (C0, C1, C0, C1, C1, C0). Is pytorch creating mini-mini-batches from the mini-batches I feed to the model?
Could this have anything to do with number of GPUs (2)?
it might be the reason if you are using
Batchnorm as well as 2 GPUs. As far as I know batch norm is local per GPU. Is it the case?
Another reason is that shuffling the data and then taking a random set of samples as the mini-batch samples offers some “implicit regularization”. And this method is considerably effective if the mini-batch sizes are very small (for e.g. 8, 16, or 32)
I am using batchnorm
If you are curious enough to test this hypothesis, try to remove
batchnorm in your network and train again to see whats happening.
@miladiouss whats your batch_size?? If its like 32 and you are distributing the data to 2 GPUS, the effective batch_size would be 16 . Here, since the batch_size is small, the parameters of BatchNorm layers are highly biased towards the mini-batches. This might be the reason.
Try other norms like SwitchNorm in case of small batch_size
Thanks @Vikas_Sangwan and everyone else. So, my batch size is 64 and I had 2 classes and 2 GPUs. And it seems the issue is from batch normalization. For now, I am shuffling my minibatches so each GPU gets the same number of each class. I will try other normalization methods and write an update at some time. Thanks for the help everyone.