Very Large model on single GPU

Hi, I have a very big model that consists of different modules and I can’t reduce the mini-batch size. Also, I only have one GPU. So what are my options here? on this forum, I found out about accumulating gradient. will it help me? but I prefer doing normal SGD calculation. Is there any way to transfer some tensors and variables to CPU between modules and when I want to backpropagate transfer them back to GPU one by one or do backpropagation on CPU?


1 Like

I think the easiest way would be to use torch.utils.checkpoint to trade memory for compute.
Manually pushing the values to the host and back to the device would probably make your training really slow.
Accumulating gradients won’t help, as the same memory would be needed.


Check out Large Model Support for PyTorch:


This is the contribution proposal for the LMS feature that is currently available in the Watson Machine Learning solution. For information about that solution, see

1 Like