Why TorchScript module does not take less GPU memory than Pytorch model?

I wanted to use less gpu memory and make inference speed faster by converting Pytorch models to TorchScript.

TorchScript can create serializable and optimizable models from Pytorch code so I expected inference speed would be faster and also the size of module would be lighter.

However, the size of a module was not decreased after tracing a original pytorch model. I could see that the inference speed becomes more fast almost 12% in my environment though.

I checked the size as below.

I used two functions to check the GPU memory usage

  • torch.cuda.memory_allocated=> the current GPU memory occupied calculated by Pytorch.
  • GPUtil library memoryUsed. => same as nvidia-smi GPU memory usage. Global.

I checked the GPU memory size with the original pytorch model loaded and traced it and saved using torch.jit.save()

# checked torch.cuda.memory_allocated(0)
# checked GPUtil.getGPUs()[0].memoryUsed  (GPUtil library uses nvidia-smi)
traced_module = torch.jit.trace(model, input)
torch.jit.save(traced_module, "scnet_traced.pt")

After finishing the above process, I checked if GPU is empty and then ran the below process.

scnet = torch.jit.load('scnet_traced.pt', map_location=deploy_device)
# checked torch.cuda.memory_allocated(0)
# checked GPUtil.getGPUs()[0].memoryUsed  (GPUtil library uses nvidia-smi)

The memory sizes that torch calculate and global nvidia-smi calculate were not decreased.

Can we decrease the size of module by Scripting?
or Scripting module does not decrease the GPU memory size that a module takes up?

I don’t want to change the algorithm or basic structure of the original model.
Do you recommend any other methods to use less GPU memory for deployment?

I’m not an expert but why do you assume it should use less memory?
Pytorch (in python) relies on cuda the same way. The only difference is that the traced model is static. Once you use torch no grad there are no graphs dragged out of the scope of the different functions you can code, thus, the memory consumption is gonna be the same.

Besides, if the original source code was properly coded, there shouldn’t be even speed improvements.

If you wanna reduce the memory comp, you should use fp16. With modern gpus that ensures ~x2 speedup and less memory comp.

EDIT: Quantization mechanisms can be faster and lighter, although it’s really likely fp16 to have the same performance as the original model.

1 Like

You might not expect to see memory savings right now depending which utilities you are using.
E.g. once the model is scripted, utils. such as AOTAutograd would try to cut down the memory usage by avoiding to store unneeded activations (and allow to recompute them instead). But these utils. are still in development so experimental.

That’s not entirely true. By default PyTorch is “eager” i.e. each line of Python code will be executed as it’s written. Scripting a model could fuse operations into a larger block and thus avoid the expensive memory reads and writes. In the latest 1.12.0 release our nvFuser backends is enabled by default for CUDA workloads, which is able to fuse pointwise operations etc.
We are working on a blog post and tutorial to explain these fusions in more detail, but you could also take a look at e.g. this topic which shows a speedup of ~3.8x compared to the eager execution for a custom normalization layer.

1 Like

I didn’t know about it.
Most of my attempts to use torchscript to speed up the forward pass showed no improvement at all.

Are there other kind of situations besides pointwise ops?

Yes, we are actively working to increase the support for more operations (such as matmuls etc.) and will push these changes into the master/nightly branch once they are available.
Btw. if you want to learn more about nvFuser, check out this GTC talk.

1 Like