Static quantization for Transformer block : AttributeError 'function' object has no attribute 'is_cuda'

Astariul · October 5, 2023, 2:34am

I’m trying to apply static quantization to a model using a nn.TransformerEncoderLayer.

But when running the model, I get the following error :

File "/envs/transfo/lib/python3.10/site-packages/torch/nn/modules/transformer.py", line 556, in <genexpr>
    elif not all((x.is_cuda or 'cpu' in str(x.device)) for x in tensor_args):

AttributeError: 'function' object has no attribute 'is_cuda'

The model is very basic : embeddings followed by a TransformerEncoderLayer, followed by a linear layer. But I can’t make it work…

Here is a Colab notebook reproducing the issue : Google Colab

Here is the script reproducing the issue :

import torch
from torch import nn
from torch.ao.quantization import qconfig


class Quantformer(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, dropout_rate, max_seq_len):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.embedding_dim = embedding_dim

        self.quant = torch.ao.quantization.QuantStub()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.pos_embedding = nn.Embedding(max_seq_len, embedding_dim)
        self.transformer = nn.TransformerEncoderLayer(
            d_model=embedding_dim,
            nhead=8,
            dim_feedforward=hidden_dim,
            dropout=dropout_rate,
            batch_first=True,
        )
        self.dropout = nn.Dropout(dropout_rate)
        self.fc = nn.Linear(embedding_dim, vocab_size)
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, src):
        seq_len = src.size(1)
        batch_size = src.size(0)
        pos_ids = torch.arange(seq_len, dtype=src.dtype, device=src.device).unsqueeze(0).repeat(batch_size, 1)

        embeds = self.dropout(self.embedding(src)) + self.pos_embedding(pos_ids)
        embeds = self.quant(embeds)
        mask = nn.Transformer.generate_square_subsequent_mask(embeds.size(1), device=embeds.device)
        out = self.transformer(embeds, src_mask=mask)
        lm_logits = self.dropout(self.fc(out))

        lm_logits = self.dequant(lm_logits)
        return lm_logits


device = torch.device("cpu")
sq_model = Quantformer(
    vocab_size=10000,
    embedding_dim=128,
    hidden_dim=512,
    dropout_rate=0,
    max_seq_len=10,
).to(device)
sq_model.eval()

sq_model.qconfig = torch.ao.quantization.get_default_qconfig("qnnpack")
sq_model.embedding.qconfig = qconfig.float_qparams_weight_only_qconfig
sq_model.pos_embedding.qconfig = qconfig.float_qparams_weight_only_qconfig

sq_model_prepared = torch.ao.quantization.prepare(sq_model)

x = torch.randint(3, 10000, (1, 10))

sq_model_prepared(x)

squant_model = torch.ao.quantization.convert(sq_model_prepared)

yy = squant_model(x)

jerryzh168 · October 6, 2023, 9:34pm

does it work without quantization? this doesn’t look like a quantization issue

Astariul · October 7, 2023, 7:58am

@jerryzh168 Definitely a problem with quantization, it works without the quantization.

I added a code block in the google colab notebook to run without the quantization, so you can try running and see it works.

The problem is here : https://github.com/pytorch/pytorch/blob/main/torch/nn/modules/transformer.py#L648-L662

When running in normal mode, all of these attributes are tensors. But when running in quantization mode, some of them are getter methods (see here for example : https://github.com/pytorch/pytorch/blob/main/torch/ao/nn/quantized/modules/linear.py#L234)

jerryzh168 · October 13, 2023, 2:21am

oh I see, yeah this is expected I think, eager mode quantization does not expect people call into linear_module.weight directly, it only works when people just use the forward function for linear, e.g. self.linear(x) and also users will need to place QuantStub/DeQuantStub properly.

you’ll probably need to rewrite it into a format that just calls self.linear(x) instead of querying the weight directly I think.

Or maybe in the future we would plan to support this in our new flow and you may just get this working out of the box.

Astariul · October 13, 2023, 2:49am

you’ll probably need to rewrite it into a format that just calls self.linear(x) instead of querying the weight

But I’m not, I’m just calling the Transformer layer from the standard Pytorch.
Do you mean I need to monkey-patch the Transformer implementation by myself ?

eager mode quantization does not expect people call into linear_module.weight directly

I see… Transformer being part of the Pytorch library, I expected it to be quite straightforward to quantize.
What are the alternatives to Eager mode quantization ? I tried FX graph mode quantization, but it didn’t work (I can’t trace the Transformer model).

Did anyone tried to quantize the standard Transformer implementation ?

jerryzh168 · October 13, 2023, 3:03am

Did anyone tried to quantize the standard Transformer implementation ?

there has been some experiments/research projects doing this, nothing production ready yet. maybe @HDCharles and @Chillee can give more context here.