I want to train VGG and ResNet from scratch. These models are too big to fit in a single GPU. Always get CUDA out of memory for batches sizes of 128 or 256. Is there any way I could train these models? Should I use multiple GPUs to train with each GPU processing small batch size?
Thanks in advance for any solutions.
If you have multiple GPUs, you could use e.g.
DistributedDataParallel to chunk the batch so that each model (and device) will process a smaller batch size.
Alternatively, you could lower your batch size or use
torch.utils.checkpoint to trade compute for memory.
@ptrblck, @Eta_C can you elaborate mode on checkpoint and if you have an example how to code it ?
Here is a notebook showing an example usage (which is quite old by now, but should still show the proper usage of it).