I’ve been playing around with the experimental quantization features introduced in v1.3, and trying to apply it to some Transformer model.
Dynamic quantization works quite well, speeding up to a factor 1.5-2.5x depending on configurations.
As for static quantization, it seems some parts of the model are not compatible (LayerNorm, some div/matmul in the multi-head attention setup). I’m wondering what would be the best way of tackling such a case.
- Can we specify some submodules to ignore with Qconfig for instance?
- Is there a roadmap to implement the missing operators and modules (matmul, div, nn.LayerNorm, etc.)?