Status of CUDA Quantization

Hi there,

I have been facing a frustrating issue over the past few months.

I am creating a CUDA-accelerated neural style transfer plugin in LibTorch, but as it is, it takes up far too much VRAM because the model I’m using (VGG-19) is so large. The only viable solution I can see for adequately (3-4x) reducing VRAM is through int8 quantization. However, this isn’t currently implemented in PyTorch.

I tried using Torch-TensorRT, and even contributed to getting it working on Windows, but that won’t support back-propagation (a requirement for neural style transfer), plus I’m not even sure it will substantially (if at all) reduce VRAM usage; I am planning on testing this soon.

The only viable solution seems to be PyTorch/LibTorch supporting CUDA int8 quantization. I’ve seen it mentioned across Github and this forum for a few years, but there doesn’t seem to be any clear indication on its current status.

When can we expect this feature to be released? Will it be a part of 1.12? Is there currently a fork of PyTorch that already has it implemented, at least with enough supported ops to run VGG-19?

Also, if you have any other suggestions that might be helpful for this problem, that would be greatly appreciated; I am open to almost any solution here.

Thank you!
Jonah

Just checking, have you tried NVIDIA amp?

We are working on PyTorch quantization → TensorRT as well as eager mode int8 CUDA kernels, but both of those are not releasing in v1.12 (no committed release date at this moment) and are also targeting inference first, without support for backpropagation.

For unofficial status of eager mode CUDA int8 ops, you could check out pytorch/test_quantized_op.py at master · pytorch/pytorch · GitHub, specifically the tests marked with @unittest.skipIf(not TEST_CUDNN, "cudnn is not enabled.")