That’s right. When there are multiple processes on one GPU that each use a PyTorch-style caching allocator there are corner cases where you can hit OOMs, but it’s very unlikely if all processes are allocating memory frequently (it happens when one proc’s cache is sitting on a bunch of unused memory and another is trying to malloc but doesn’t have anything left in its cache to free; if the first one were allocating at all it would hit the limit and know to free its cache). It could be improved, but it’s a lot better than frameworks that commandeer your whole GPU even if they’re only using 100MB…
I have run into a related issue while using the experimental Windows version. in my train phase, CUDA allocates about 4GBs for mini-batches and I optimize my params. Then when I am done and want to predict on a separate dataset, using the same mini-batch size, a fresh new 4GBs are allocated.
To be more precise, when i am done training, and nothing but the model should remain on the GPU, I can breakpoint and issue these commands: (all memory readings come from nvidia-smi): T = torch.rand(1000,1000000).cuda() // Now memory reads 8GB (i.e. a further 4 GB was allocated, so the training 4GB was NOT considered ‘free’ by the cache-allocator, even though it was being reused during training) del T // Still 8 GB (as expected) T = torch.rand(1000,1000000).cuda() // Still 8GB as expected, the cache-allocator is reusing the same space as the first T above
So it looks like the 4GB from training are still taking up space on the GPU, even though they should be freed. But later they are being reused (when retraining the same model). I.e. they can be reused for the same purpose but not for arbitrary tensors - which makes no sense to me, of course.
Is there a way to manually force the caching allocator to free some GPU memory space? Or, since it seems that the cache-allocator doesn’t think the space is actually free - Can I pull my model.to_cpu() and then ask torch to free everything it has on the GPU?
good call, thanks, We already set our input variables in predict() to volatile=True. My impression is that GPU memory left committed from the training is being ‘hoarded’ and it is that memory that I would like to clear / free / repurpose. (I actually tried setting volatile=False, to all my variables in the predict method, but that didn’t fix the memory ‘leak’)
I realise where I was making a mistake. My model has an LSTM and I’m supposed to pass on a new, empty variable as the hidden state. If I pass on an existing variable, such as the hidden state from the previous timestep, the model backprops all the way back to the first epoch on every epoch of training. This is precisely why the GPU memory kept exploding after every epoch. I now have something like this, and it works fine.
I have a similar problem to @nikhilweee. I have tried to clean garbage with calls to both torch.cuda.empty_cache() and torch.cuda.ipc_collect(). This works great in my CNN training loop. Reserved memory stays constant for any number of batches. But in the validation loop, the memory use climbs after each iteration. I found that the difference is the loss.backward() statement in the training loop is cleaning out the garbage somehow. Since this isn’t in validation, it just keeps piling up in spite of the calls to torch.cuda.ipc_collect().
Anybody know what’s going on in the backward method?
Hi, sir, torch.cuda.empty_cache() is really help. Recently, I also came across this problem. Normally, the tasks need 1G GPU memory and then steadily went up to 5G. If torch.cuda.empty_cache() was not called, the GPU memory usage would keep 5G. However, after calling this function, the GPU usage decrease to 1-2 G.
I am training an RL project with PyTorch 0.4.1. So, here I am still confused and cannot find reason. I used TF before and there is no such issue.
I was able to free probably all the GPU memory used by tensors by using the following sequence:
model.to(‘cpu’) # this allows moving to area where you probably have more memory
model_RAM_copy = model.state_dict()
delete_tensors() # this function goes through model children and deletes .weight, like del model.layer1.weight; I do not care about .bias, they are much smaller
model = newModel().to(‘cuda:0’) # recreating the model structure from scratch
criterion =… optimizer =…, lr_scheduler =…
model = load_state_dict(model_RAM_copy )
In fact, my code is a little longer, but I think there is some redunduncy, I just didn’t take time to optmize it. Anyway, it works in terms of memory usage (I can see a beutyful line of GPU Memory usage going up and down, within some stable limits). I am sure this can be done more smoothly, but I was looking for a solution for few days and ended up with this. Does not look pretty, but works. I am not sure about backpropagation correctness or preserving learning rate statistics. Network learns pretty ok, but it is possible that I am loosing some parameters on the way.