Hello everyone,
First, I want to mention that I am a beginner in the field of quantization, so my question might seem basic.
I would like to run quantized DNN models on a GPU. However, as far as I understand from the PyTorch documentation, most quantization techniques are only supported on CPUs, and GPU support for these features seems to be quite limited. (I also tried and got the result as quantied_linear is not supported by CUDA backends…)
What confuses me is that in repositories like QLoRA, it appears that quantization is being performed using torch.
Is the quantization functionality being provided by Hugging Face or the bitsandbytes
library, or is it actually utilizing torch for this process?
Additionally, if I provide FP16 data as input to a standard FP32 kernel, will the computation proceed correctly?
Thank you in advance for your help!