I’m currently training a faster-rcnn model. Normal training consumes ~1900MiB of gpu memory. When I try to resume training from a checkpoint with torch.load, the model takes over 3000MiB. With identical settings specified in a config file.
Yes. I always follow the best practice to save and load the state_dict.
I found a related issue here
It says torch.cuda.empty_cache()might help, but in my case I still have OOM.
By the way, I’m using pytorch 0.3.1
Yes. With map_location=lambda storage, loc: storage, tensors in checkpoint are in CPU memory at first.
However, I guess load_state_dict may cast tensors to the corresponding device of model parameters internally, and the references to the casted tensors are still held by checkpoint. I don’t really trace it down.
Hi roytseng, I face the same problem when using 0.3.1, and it still blocking me. I have tried del+cuda.empty_cache() but it doesn’t work in my case. I notice your comment “pytorch 0.3.1 has a bug on this, it’s fix in master”, could you explain more on this? Links to the original issues or commits would be really helpful(i have checked commits about optimizers but do not find it). Thanks in advance!
I not really remember if the bug in load_state_dict of optimizer is related to the memory usage increment or not (I guess it’s not). However, it’s sure that this bug has been fixed in pytorch0.4.
Hi all,
Have you been able to fix the problem?
I am experiencing the same problem and I am using pytorch 0.5, so it does not seem that the new versions solve the problem. I am using:
if load:
checkpoint = torch.load('./model.ckpt')
startEpoch = checkpoint['StartEpoch']
model.load_state_dict(checkpoint['state_dict'])
del checkpoint
torch.cuda.empty_cache()
but the problem persists. The problem is located in the point
loss.backward
however, the model is the same; and I am not loading anything else than the model and the epoch number. I need to reduce the batch size which is very annoying.
Thanks,
Dani.
Worked for me !
It seems to have saved around +500mb in my case!
I was able to save but not load in pytorch. The checkpoint definitely took up valuable valuable gpu memory.
This thread is really old, but I have to report this can still be an issue with pytorch 1.8.1. I had to lower model’s batch size from 64 to 40 to just be able to resume from a checkpoint before for example. That’s quite a performance hit actually. This method works beautifully and is still relevant, cheers.
Edit: I can also confirm map_location=cpu is also works.
Still an issue. My model takes 31000MiB, and I got a 32GB gpu. If I start training from scratch, stuff works; If I resume from a checkpoint, I get an OOM.