Hi, I have a very big model that consists of different modules and I can’t reduce the mini-batch size. Also, I only have one GPU. So what are my options here? on this forum, I found out about accumulating gradient. will it help me? but I prefer doing normal SGD calculation. Is there any way to transfer some tensors and variables to CPU between modules and when I want to backpropagate transfer them back to GPU one by one or do backpropagation on CPU?
I think the easiest way would be to use torch.utils.checkpoint to trade memory for compute.
Manually pushing the values to the host and back to the device would probably make your training really slow.
Accumulating gradients won’t help, as the same memory would be needed.