Hi there,
I have been facing a frustrating issue over the past few months.
I am creating a CUDA-accelerated neural style transfer plugin in LibTorch, but as it is, it takes up far too much VRAM because the model I’m using (VGG-19) is so large. The only viable solution I can see for adequately (3-4x) reducing VRAM is through int8 quantization. However, this isn’t currently implemented in PyTorch.
I tried using Torch-TensorRT, and even contributed to getting it working on Windows, but that won’t support back-propagation (a requirement for neural style transfer), plus I’m not even sure it will substantially (if at all) reduce VRAM usage; I am planning on testing this soon.
The only viable solution seems to be PyTorch/LibTorch supporting CUDA int8 quantization. I’ve seen it mentioned across Github and this forum for a few years, but there doesn’t seem to be any clear indication on its current status.
When can we expect this feature to be released? Will it be a part of 1.12? Is there currently a fork of PyTorch that already has it implemented, at least with enough supported ops to run VGG-19?
Also, if you have any other suggestions that might be helpful for this problem, that would be greatly appreciated; I am open to almost any solution here.
Thank you!
Jonah