Is there a way to do quantization (mostly 8-bit) on GPUs in native pytorch while avoding TensorRT?
The docs seem to indicate to me that quantization for GPUs is possible only with TensorRT. Is that correct? If not available in main, Is there maybe a PR to work with?
I am grateful for any hints or suggestions.
Hi @fabian_schutze , we are considering this for future work but we don’t currently have this in a usable form.
yeah, we don’t have a usable support for native quantized GPU ops, here are some discussions in github as well: Quantized Inference on GPU summary of resources · Issue #87395 · pytorch/pytorch · GitHub
Thanks a lot for your replies, @Vasiliy_Kuznetsov and @jerryzh168 . They were both very informative.