Inference time GPU memory management and gc

While measuring the GPU memory usage on inference time, we observe some inconsistent behavior: larger inputs end up with much smaller GPU memory usage through querying “nvidia-smi”.

We are wondering if the reason is that PyTorch is doing lazy garbage collection, i.e., the program only does garbage collection when there are not enough GPU memory left.

Also, actually, during inference time, the program can be super sparing with GPU memory if it keeps throwing away intermediate results. We are wondering if PyTorch has such implementation?

So many research papers’ reported inference time GPU memory usage is not that accurate?

Remember to use

with torch.set_grad_enabled(False):
    # your "inference" code here

otherwise your inference mode will create a huge graph in the background for autograd (which will keep growing and growing, since you don’t use .backward() during “inference”).

1 Like

Yes, we are using

with torch.no_grad()

for inference.

We are just wondering if reporting inference time GPU memory usage is not a good idea for showing your model’s memory efficiency based on the reasons stated above.

Oh I see, sorry if I misunderstood. I don’t know about the exact automatisms, but I read somewhere that PyTorch uses as much memory as possible for computational throughput efficiency … So there might be some caching. There’s definitely some caching by cuDNN though, I noticed that when benchmarking. You can try setting

torch.backends.cudnn.benchmark = True

and see if it makes some difference in your comparison.

anybody know how i can release the gpu memory during inference?