Can we **not** reduce GPU memory consumption during inference?

I find that the memory consumed when doing inference (seen by nvidia-smi) is usually smaller than training. This is expected, but can be annoying when we are sharing GPUs with others. The afterward training may have OOM risk if others’ processes start while our code is doing inference.

I have read that PyTorch has a cache policy on GPU memory management: unused gpu memory will not be released to the OS, so that further gpu variables can be faster allocated.

Thus my questions are:
(1) Is Pytorch clearing GPU memory cache when switching from training to inference?
(2) How can we keep the GPU memory usage of inference as large as that of training, i.e., forcing Pytorch not to reduce memory consumption?