Cuda memory leak on training - reproducible example

I’ve looked through the docs and forums, but haven’t figured out why the below code is leaking memory during training. It does not leak when calling forward, and I am deleting all intermediate variables + calling gc.collect() + calling torch.cuda.empty_cache(), but still accumulate 5GB of garbage after the first batch.

Any tips on how to further troubleshoot? The above notebook should be fully reproducible and uses random data. Many thanks!

Not sure if there’s a memory leak – i.e., there’s insufficient data to determine this because your training loop runs only one time and then aborts due to high memory use.

When I reduce nfeatures = 15888 to nfeatures = 10000 I get

Allocated Memory: 871.82177734375 MB
New allocations: 2289.31884765625 MB
New allocations: 0.0 MB
New allocations: 0.0 MB
New allocations: 0.0 MB
New allocations: 0.0 MB
New allocations: 0.0 MB
...

The first 2Gb for the New allocations should be expected due to loss.backward()

but still accumulate 5GB of garbage after the first batch.

In your case, it should be also only a one time thing, but hard to tell because your loop aborts after one iteration. The reason why it’s not the case when you call forward only is that the .grad attributes are set to None before your first backward call, I’d say.

2 Likes

Thanks for the kind reply! You are right, it is not a leak per se, but rather a very high memory requirement for ADAM. This does make sense, although it was not immediately obvious that ADAM would fail for a 2GB graph! Adadelta, Adamax, and Adam required > 9GB and ran out of memory on a AWS P2 instance (K80 w/ 11GB RAM).

Using another optimizer, however, does work. SGD adds 2GB. RMSprop adds about about 4GB. Curiously, Adagrad needs only an additional ~4GB as well (although nvidia-smi showed more). I’m surprised that Adagrad can run but Adam/Adamax/Adadelta fail, as I thought the data requirements were similar in terms of per-parameter alpha. Might this suggest that the implementations for those three are memory inefficient?

It could be true that the implementation for Adam is not very memory efficient. Regarding Adadelta vs Adam, the former only has one vector of the past gradients whereas the second has two parameters, beta1 and beta2 that are multiplied with two vectors, so that’s maybe explaining why it’s roughly twice as much memory.

Btw if you really need/want to use Adam on a net where such a problem occurs, you can try the new checkpointing feature (https://pytorch.org/docs/master/checkpoint.html ; an example can be found in the release docs at https://github.com/pytorch/pytorch/releases/tag/v0.4.0 if you scroll to the section “Neural Networks”). Haven’t used it myself yet, and it will make your code slower, but it would be a workaround for now instead of using another optimizer.