I’m trying to quantize BERT to 4 bits or mixed precision, and I don’t see available methods to to quantization aware training on BERT for any precision other than torch.uint8. This is given in the dynamic quantization tutorial.
I want to use both post training quantization and dynamic quantization for lower than 8 bits.
Will I have to rewrite the modeling_bert.py (transformers/modeling_bert.py) layers with fake quantization added? How can lower than 8bit precision and mixed precision be implemented on BERT?