Context: DenseNet-like model for speech recognition - JasperNet. I have a custom leaky_relu activation function that does save_for_backward
tensors that are kept for backward by other parts of the model anyway (specifically, by my custom BatchNorm-derived modules). However, doing this increases max_memory_reserved
considerably, as if it kept additional copies of tensors or forgot to decrease refcount after activation backward is done.
How to debug memory usage of autograd engine? Can I trace the refcounts of tensors saved for backward? Is it possible to dump tensors kept in memory together with their refcounts?
I tried hard to come up with a small repro, but unfortunately I could not. So a big repro that uses my full codebase:
git clone https://github.com/vadimkantorov/convasr # unfortunately requires apex (though not used) and librosa
cd convasr
CUDA_VISIBLE_DEVICES=0 python3 benchmark.py --backward --model JasperNetBigInplace
# load+fwd 185.53 msec | bwd 1035.61 msec | cudamem 5645.53 mb
CUDA_VISIBLE_DEVICES=0 python3 benchmark.py --backward --model JasperNetBigInplaceBug
# load+fwd 186.24 msec | bwd 1029.35 msec | cudamem 3535.80 mb
The buggy model does not save for backward residual branches (and thus does not undo adding them) is controlled by this https://github.com/vadimkantorov/convasr/blob/0d0141f98db650c39723e09fc1b0f183d2ddd9ea/models.py#L225
@ngimel Maybe you have an idea how to debug such memory increases? I’m not sure if it creates copies or that it doesn’t decrease refcount when it should