I want to train VGG and ResNet from scratch. These models are too big to fit in a single GPU. Always get CUDA out of memory for batches sizes of 128 or 256. Is there any way I could train these models? Should I use multiple GPUs to train with each GPU processing small batch size?
If you have multiple GPUs, you could use e.g. DistributedDataParallel to chunk the batch so that each model (and device) will process a smaller batch size.
Alternatively, you could lower your batch size or use torch.utils.checkpoint to trade compute for memory.
[Recommend] For PyTorch >= 0.4, you can use torch.utils.checkpoint. It will compute the model piece by piece.
[Impaired performance] You can also lower your batch size, multiple forward, one backward. If you use multiple GPUs, maybe NVIDIA Apex SyncBatchNorm could help you to correcte the bad consequences.
[Impaired performance] Training a mixed precision model is also a way to solve your problem.