Train large models in GPUs

I want to train VGG and ResNet from scratch. These models are too big to fit in a single GPU. Always get CUDA out of memory for batches sizes of 128 or 256. Is there any way I could train these models? Should I use multiple GPUs to train with each GPU processing small batch size?

Thanks in advance for any solutions.

If you have multiple GPUs, you could use e.g. DistributedDataParallel to chunk the batch so that each model (and device) will process a smaller batch size.

Alternatively, you could lower your batch size or use torch.utils.checkpoint to trade compute for memory.

  • [Recommend] For PyTorch >= 0.4, you can use torch.utils.checkpoint. It will compute the model piece by piece.

  • [Impaired performance] You can also lower your batch size, multiple forward, one backward. If you use multiple GPUs, maybe NVIDIA Apex SyncBatchNorm could help you to correcte the bad consequences.

  • [Impaired performance] Training a mixed precision model is also a way to solve your problem.

@ptrblck, @Eta_C can you elaborate mode on checkpoint and if you have an example how to code it ?


Here is a notebook showing an example usage (which is quite old by now, but should still show the proper usage of it). :wink:

1 Like