Pytorch GPU utilization does not match model number of parameters

I’m running occupancy tests with an empty GPU, and when I move the following PyTorch module to the GPU (cuda:1), it occupies 843 MiB according to NVIDIA SMI:

I can’t make sense of it; the module is trivial.

This happens even with a freshly created script that creates a Linear layer and moves it to the GPU:

>>> Linear(1, 3).to("cuda:1")
Linear(in_features=1, out_features=3, bias=True)

Can anyone explain why such a simple operation results in such high GPU memory usage?

The first CUDA operation will initialize the CUDA context containing the driver, kernels, etc., and thus the memory usage is expected. All PyTorch binaries shipping with CUDA>=11.7 enabled lazy loading which lazily loads all needed kernels but reduces the context size.

Okey, this was just a small trial on a bigger error I have. I’ve got a model that is 0.615MB and the GPU ocupation when I put it on the GPU is 1689MiB. Does it seem feasible to you? How could I see what are those 1689MiB on the device?

You can use torch.cuda.memory_summary() to check PyTorch’s memory usage and nvidia-smi to check which processes consume GPU memory. The CUDA context will not be shown in memory_summary().