workflow for the qat now is:
using the same precision in each fake_quant for EVERY LAYER.
fp32 → fake_quant → fp32
problem i meet:
1st
input data may be 8bit in most common cases.
when qat int4 model, first layer fake_quant “8bit data into 4bit” (or we call cut the data spread).
in this process we lost too much(precision drop happens in the input data) …
IF we can treat first layer with 8bit qconfig, and treat other layer with 4bit qconfig.
we can keep some more necessary input data.
2nd
is there any doc to use 2 or more qconfig in the same qat process.
3rd
ive notice Add MKLDNN quantization backend by Xia-Weiwen · Pull Request #67177 · pytorch/pytorch · GitHub mkldnn
but its a bit different, mkldnn just change internal compute logic,
i just wanna to add a new backend.
i find some reference here: Extending PyTorch Quantization to Custom Backends · pytorch/pytorch Wiki · GitHub
Is there any suggestion on develop a new backend, especially for qat.
Purpose for me:
- im working to develop a new int4 qat qconfig(or we say a new int4 backend) for a specific dla,
in my opinion using 4bit in all layers may cause precision drop, especially for the first layer.
Im try to deal with the problem in the first layer to keep as more information as possible to prevent precision drop. - also, im try to find some use case/demo on how to use hybrid quant schemem, for example using 8bit qconfig and fp16 qconfig in the same qat process. any user interface.
- im searching for some hybrid quant qat demo, do u have some?
- any suggestion on develop a new int4 qat qconfig (or we say a new int4 backend).