I train resnet50 for a big batch size on Adam optimizer, I do it for reason. The reason is model quality the more batch size I use than better model I will get for my loss function. I know that adam uses 4X more GPU memory than simplest SGD. But I want use adam and I want huge batch size (128 and more). Currently my model does not fail for 32 batch size and SGD. Is there any way in pytorch by using CPU memory save/restore techniques to run with huge batch size and complex optimizer?