'model.eval()' vs 'with torch.no_grad()'

@ptrblck Wonderful, I never think about the effectiveness of shuffle training data in that way.

So if we shuffle data, we need a batch size big enough to handle enough information for the best mean and variance for the running stat, right?

If I have limited memory in GPU, I can keep the performance with accumulate gradient for update weights and batch norm synchronization (as this comment). My question is what should I do with batch norm if I only have 1 GPU?

Other options I can find is Group Norm, but the pretrained models with group norm are not so popular.