Confusion Regarding Quantization on GPUs with PyTorch

Woosung · August 13, 2024, 8:05am

Hello everyone,

First, I want to mention that I am a beginner in the field of quantization, so my question might seem basic.

I would like to run quantized DNN models on a GPU. However, as far as I understand from the PyTorch documentation, most quantization techniques are only supported on CPUs, and GPU support for these features seems to be quite limited. (I also tried and got the result as quantied_linear is not supported by CUDA backends…)

What confuses me is that in repositories like QLoRA, it appears that quantization is being performed using torch.

Is the quantization functionality being provided by Hugging Face or the bitsandbytes library, or is it actually utilizing torch for this process?

Additionally, if I provide FP16 data as input to a standard FP32 kernel, will the computation proceed correctly?

Thank you in advance for your help!

supriyar · August 13, 2024, 6:05pm

Hey,

Quantization is supported on GPUs using PyTorch. A lot of our work in this space has been in this repo GitHub - pytorch/ao: The missing pytorch dtype and layout library for training and inference. Please check it out and let us know if you run into any issues, thanks!