Is there a way to perform inference on the QAT model using a GPU?

ihjnlee379 · January 9, 2024, 6:26am

Is there a way to perform inference on a QAT (Quantization-Aware Training) model converted from QAT on a GPU?
Currently, in order to speed up inference, I performed QAT on the base GPU model to convert it into a quantized model.
While the model size has reduced to 0.25 size, the inference time on the quantized CPU model is longer than on the base GPU model.
It seems like this might be due to using the CPU, but is there a way to perform inference on the QAT model using a GPU?

HDCharles · January 11, 2024, 7:57pm

not really. GPU quantization is its own thing GitHub - pytorch-labs/ao: The torchao repository contains api's and workflows for quantization and pruning gpu models. that’s currently in development. We don’t have QAT working there yet. Also we only have dynamic quantization (and only for linear layers) so usually QAT isn’t necessary.

I’d recommend checking that out and seeing whether that serves your use case. QAT is just a tool to improve accuracy of quantization, most often when doing static quantization. Not usually a means to an end in and of itself.

Depending on your model, gpu quantization should be a lot faster than running it on gpu or running it on gpu without quantization. See for examples of this being done for SAM, LLama and SDXL.