Does Dynamic Quantization support GPU?

Hello,
I’ve tried doing dynamic quantization on the XLNet model during inference, and I got this error message:

RuntimeError: Could not run 'quantized::linear_dynamic' with arguments from the 'CUDA' backend. 'quantized::linear_dynamic' is only available for these backends: [CPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, Tracer, Autocast, Batched, VmapMode].

This leads me to believe dynamic quantization doesn’t support CUDA, and if so, do you guys plan to have CUDA support for quantization for both training and inference? I couldn’t find any issues relating to this on Github. Thanks!

yeah it is not supported on CUDA, quantized::linear_dynamic is only supported in CPU. We do not have immediate plans to support CUDA but we plan to publish a doc for custom backends which will make the extension easier.

1 Like

hi, where could we get the doc?

will think about post one in OSS, please keep an eye out for that in github issues page, we are currently working on enabling CUDA path through TensorRT as well, had a prototype here: [not4land] Test PT Quant + TRT path by jerryzh168 · Pull Request #60589 · pytorch/pytorch · GitHub

I can share the doc early with you if you message me your email. but we may make some modifications before publishing in oss