How to offload/prefetch feature maps in PyTorch?

I have a 4GB Quadro M1200 GPU. I have been wondering for a long time now, on how to train large neural nets on my GPU. One idea is to create small batches and uses 3 batches as a single batch.

I came around this How to Train a Very Large and Deep Model on One GPU?. A short summary of the post. GPU stores the feature maps during the forward prop in the GPU memory itself and this occupies like 50-70% GPU memory. And the solution was to move the feature maps during forward prop to CPU memory and during the back prop they would be moved again to GPU memory.

I was wondering if there was some way to implement this in PyTorch or I have to start working on this project from scratch then.

You can ignore the fact that the latter layers would need their feature maps quicker than the starting layers. Implementation is the focus for this post, and we can get into optimizations after that.