Control GPU Memory cache

Maybe what you are seeing is how PyTorch manages memory using the CUDA caching allocator. Even if the program does not subsequently use that region of memory, the memory already allocated by PyTorch would not be returned to the device to avoid excessive cudamalloc/free calls every iteration.

Calling empty_cache can be used to alleviate memory fragmentation, which may also be an issue sometimes, but it does slow down your iterations yes.

Autograd needs to save tensors for backward in order to recompute the backward pass (but not sure what you mean by “not used by subsequent processes”), so if you want to scale your model but are bottlenecked by memory usage, another solution is to use activation checkpointing, which allows you to save fewer activations for backward during forward in exchange for having to recompute them during backward.

Further reading:

https://pytorch.org/docs/stable/checkpoint.html