Best way to quantize Transformer architecture

Hi there,

I’ve been playing around with the experimental quantization features introduced in v1.3, and trying to apply it to some Transformer model.

Dynamic quantization works quite well, speeding up to a factor 1.5-2.5x depending on configurations.

As for static quantization, it seems some parts of the model are not compatible (LayerNorm, some div/matmul in the multi-head attention setup). I’m wondering what would be the best way of tackling such a case.

  • Can we specify some submodules to ignore with Qconfig for instance?
  • Is there a roadmap to implement the missing operators and modules (matmul, div, nn.LayerNorm, etc.)?


Yes, you can set the corresponding module’s qconfig=None to bypass the quantization. If you just want to do it for some instances, you can also do that in the parent module’s init. Also make sure the quant and dequant are set correctly on the boundary.

For more operators support, we are working on enhancing this part. Also feel free to contribute.

1 Like

I am using a standard out-of-the-box nn.transformer module (it was a choice when 1.2 was released - I did not need BERT, so I decided to opt for a standard module) like this:

layer = nn.TransformerEncoderLayer(d_model=size,
                                   dim_feedforward=size * decoder_girth,
self.decoder = nn.TransformerEncoder(layer, decoder_layers

Now I have encountered this error. It is a bit cryptic, but is my assumption correct that the erroneous function should just be replaced with a nn.module version? Or am I missing sth?

Maybe someone had this sort of error? It is probably going to be fixed in 1.5.0, but I would need to monkey patch it somehow now