I wanted to reduce the size of Pytorch models since it consumes a lot of GPU memory and I am not gonna train them again.
First, I thought I could change them to TensorRT engine.
and then I was curious how I can calculate the size of gpu memory that it uses.
Pytorch model size can be calculated by
torch.cuda.memory_allocated
or
calculating using model.parameters() and model.buffers()
I checked if the above results had same values and they had.
But the size of TensorRT engine or other Scripted modules for GPU cannot be calculated by the above torch functions.
So I thought I could check the gpu memory usage size with GPUtil library.
However, the memory usage size that was calculated by GPUtil library (using nvidia-smi) was too different.
For example, one model has 13MiB size but almost 2 GiB was allocated in GPU. The other model has 171MiB but also around 2GiB was allocated in GPU. I didn’t put other objects such as inputs in GPU.
and Even after deleting the model,
del model
gpu = GPUtil.getGPUs()[0]
memoryUsed = gpu.memoryUsed
memory was still not 0, while torch.cuda.memory_allocated(0) shows 0.
how do you calculate the GPU memory that a pytorch model uses?
or how do you compare the GPU memory that a pytorch model uses and its script-mode uses?
and if I understood right and used the right functions, why is the actual allocated memory that different from real torch tensor bytes?
I knew it could be different because of using defined page sizes but I didn’t expect that it could be that much different (2GB difference).
PyTorch will create the CUDA context in the first CUDA operation, which will load the driver, kernels (native from PyTorch as well as used libraries etc.) and will take some memory overhead depending on the device.
PyTorch doesn’t report this memory which is why torch.cuda.memory_allocated() could return a 0 allocation.
You would thus need to use nvidia-smi (or any other “global” reporting tool) to check the overall GPU memory usage.
Is there a way to measure the peak gpu memory consumption between two points in time? For example, I am running a forward call on some network. Memory start at value x, then during the forward call goes to y, and then at the end back to x- I would like to discover the y value.
print(f"gpu used {torch.cuda.max_memory_allocated(device=None)} memory")
# gpu used 0 memory
x = torch.randn(1024, device="cuda")
y = torch.randn(1024, device="cuda")
print(f"gpu used {torch.cuda.max_memory_allocated(device=None)} memory")
# gpu used 8192 memory
# should return same memory usage as nothing was deleted
torch.cuda.reset_peak_memory_stats(device=None)
print(f"gpu used {torch.cuda.max_memory_allocated(device=None)} memory")
# gpu used 8192 memory
# delete tensors and reduce peak memory usage
del y
torch.cuda.reset_peak_memory_stats(device=None)
print(f"gpu used {torch.cuda.max_memory_allocated(device=None)} memory")
# gpu used 4096 memory
del x
torch.cuda.reset_peak_memory_stats(device=None)
print(f"gpu used {torch.cuda.max_memory_allocated(device=None)} memory")
# gpu used 0 memory
I’ve put it in a random place in my program, and it didn’t show 0. It should, right? There is occupied memory by the GPU, but if it’s right after the reset_peak, I guess it should show 0, but it doesn’t.
In case you misunderstood me: calling reset_peak_memory should not always show a 0 memory usage, since memory could still be allocated making it the new peak as seen in my code snippet.
You need to know the size of each parameter (e.g., 4 bytes for float32) and estimate activation and optimizer states based on your model’s architecture. Liteblue Tools like TensorBoard or built-in profiling tools in frameworks like PyTorch or TensorFlow can also help.