Options to run a model that uses too much memory for the GPU?

I have a model that will use more memory than is available on the GPU

  • one option could be to train it on the CPU, but that sounds slow, though not impossible
  • one option could be to split the model across multiple GPUs, but I’d rather avoid that, for various reasons
  • an option that occurs to me is something like:
    • run part of the forward pass on one gpu
    • somehow move the intermediate results of that into main memory, clean them out of gpu memory (except the outputs themselves)
    • rinse and repeat
    • in the backward pass, calculate the gradients flowing back to these outputs
    • move each of the sets of intermediate results onto the gpu, and pass the relevant gradients through them
    • doable?
    • faster than just running everything on cpu??? (not faster in dev time I imagine…)


How about using the DataParallel? You can split the training batches across multiple GPUs.

I would recommend first finding the smallest batch size that you can use with only 1GPU. For example, if the smallest batch you can run on a single GPU is N=8, then utilizing 4 GPUs allows you to increase the batch size N=32.

Ok. How to use data parallel to split training across a single gpu?

No there is no point of doing that if you only have 1 GPU. In one of your options you were tlaking about multiple GPUs, so I assumed you have access to multiple GPUs.

Have you tried torch.utils.checkpoint? That will trade off a bit of speed for memory :slight_smile:

Thanks! Will take a look :slight_smile: