Recently I used pytorch quantization-aware training to quantize my model.
The result still has good accuracy, and it uses per channel scales.
However, our hardware colleagues told me that because it has FP scales and zero-points in channels, the hardware should still support FP in order to implement it.
They also argued that in each internal stage, the values (in-channels) should be dequantized and converted to FP and quantized again for the next layer.
I’m not sure if this argument is valid or if such limitation is applied on such quantization?