While measuring the GPU memory usage on inference time, we observe some inconsistent behavior: larger inputs end up with much smaller GPU memory usage through querying “nvidia-smi”.
We are wondering if the reason is that PyTorch is doing lazy garbage collection, i.e., the program only does garbage collection when there are not enough GPU memory left.
Also, actually, during inference time, the program can be super sparing with GPU memory if it keeps throwing away intermediate results. We are wondering if PyTorch has such implementation?
So many research papers’ reported inference time GPU memory usage is not that accurate?
model.eval()
with torch.set_grad_enabled(False):
# your "inference" code here
otherwise your inference mode will create a huge graph in the background for autograd (which will keep growing and growing, since you don’t use .backward() during “inference”).
We are just wondering if reporting inference time GPU memory usage is not a good idea for showing your model’s memory efficiency based on the reasons stated above.
Oh I see, sorry if I misunderstood. I don’t know about the exact automatisms, but I read somewhere that PyTorch uses as much memory as possible for computational throughput efficiency … So there might be some caching. There’s definitely some caching by cuDNN though, I noticed that when benchmarking. You can try setting
torch.backends.cudnn.benchmark = True
and see if it makes some difference in your comparison.