nn.MultiheadAttention fails after quantization

skurzhanskyi · July 16, 2020, 3:54pm

Hello all,
I try to quantize nn.TransformerEncoder, but get errors during inference.
The problem is with nn.MultiheadAttention, which is basically a set of nn.Linear operations and should work OK after quantization.
Minimal example:

import torch

mlth = torch.nn.MultiheadAttention(512, 8)
possible_input = torch.rand((10, 10, 512))
quatized = torch.quantization.quantize_dynamic(mlth)
quatized(possible_input, possible_input, possible_input)

It fails with:

/opt/miniconda/lib/python3.7/site-packages/torch/nn/functional.py in multi_head_attention_forward(query, key, value, embed_dim_to_check, num_heads, in_proj_weight, in_proj_bias, bias_k, bias_v, add_zero_attn, dropout_p, out_proj_weight, out_proj_bias, training, key_padding_mask, need_weights, attn_mask, use_separate_proj_weight, q_proj_weight, k_proj_weight, v_proj_weight, static_k, static_v)
   3946     assert list(attn_output.size()) == [bsz * num_heads, tgt_len, head_dim]
   3947     attn_output = attn_output.transpose(0, 1).contiguous().view(tgt_len, bsz, embed_dim)
-> 3948     attn_output = linear(attn_output, out_proj_weight, out_proj_bias)
   3949 
   3950     if need_weights:

/opt/miniconda/lib/python3.7/site-packages/torch/nn/functional.py in linear(input, weight, bias)
   1610         ret = torch.addmm(bias, input, weight.t())
   1611     else:
-> 1612         output = input.matmul(weight.t())
   1613         if bias is not None:
   1614             output += bias

AttributeError: 'function' object has no attribute 't'

That's because `.weight` is not parameter anymore, but the method (for components of the quantized module).

You can check it like:

mlth.out_proj.weight

Parameter containing:
tensor([[-0.0280,  0.0016,  0.0163,  ...,  0.0375,  0.0153, -0.0435],
        [-0.0168,  0.0310, -0.0211,  ..., -0.0258,  0.0043, -0.0094],
        [ 0.0412, -0.0078,  0.0262,  ...,  0.0328,  0.0439,  0.0066],
        ...,
        [-0.0278,  0.0337,  0.0189,  ..., -0.0402,  0.0193, -0.0163],
        [ 0.0034, -0.0364, -0.0418,  ..., -0.0248, -0.0375, -0.0236],
        [-0.0312,  0.0236,  0.0404,  ...,  0.0266,  0.0255,  0.0265]],
       requires_grad=True)

while

quatized.out_proj.weight

<bound method Linear.weight of DynamicQuantizedLinear(in_features=512, out_features=512, qscheme=torch.per_tensor_affine)>

Can you please guide me about this? Is it expected behavior? Should I report it to pyTorch GitHub issues?
It looks like quantization break all the module which use .weight inside.

Thanks in advance

Vasiliy_Kuznetsov · July 20, 2020, 4:20pm

hi @skurzhanskyi, I am able to run your example without issues on the nighly. What version of PyTorch are you using? Can you check if using a more recent version / a nightly build fixes your issue?

skurzhanskyi · July 20, 2020, 9:58pm

Hi @Vasiliy_Kuznetsov
Thanks for the reply. Indeed, in the nightly version, there’s no error. At the same time, nn.Multihead doesn’t compress, nevertheless, it’s just a set of Linear operations
Is there any information regarding adding quantization to the layer (or for instance nn.Embedings)?

Vasiliy_Kuznetsov · July 22, 2020, 3:55pm

yes, currently nn.MultiheadAttention is not supported yet in eager mode quantization. There are folks working on adding support for both this and embeddings quantization.

skurzhanskyi · July 23, 2020, 11:51am

Good to hear that. Is there any open information when it will be released (at least approximately)?

Vasiliy_Kuznetsov · July 28, 2020, 10:40pm

hi @skurzhanskyi, one other thing you could try is https://pytorch.org/blog/pytorch-1.6-released/#graph-mode-quantization , which we just released today in v1.6. It might be easier to make multiheadattention work in graph mode.

As far as first class quantization for nn.MultiheadAttention and nn.EmbeddingBag / nn.Embedding - we don’t have a specific timeline we can share, but it should be on the order of months (not weeks or years) - we have folks actively working on this.

skurzhanskyi · July 29, 2020, 4:02pm

@Vasiliy_Kuznetsov thanks a lot for your answer