Hello,
I’ve tried doing dynamic quantization on the XLNet model during inference, and I got this error message:
RuntimeError: Could not run 'quantized::linear_dynamic' with arguments from the 'CUDA' backend. 'quantized::linear_dynamic' is only available for these backends: [CPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, Tracer, Autocast, Batched, VmapMode].
This leads me to believe dynamic quantization doesn’t support CUDA, and if so, do you guys plan to have CUDA support for quantization for both training and inference? I couldn’t find any issues relating to this on Github. Thanks!
yeah it is not supported on CUDA, quantized::linear_dynamic is only supported in CPU. We do not have immediate plans to support CUDA but we plan to publish a doc for custom backends which will make the extension easier.