Dear PyTorch community,
I recently used PyTorch’s dynamic quantization technique to quantize the weights of a fine-tuned roberta-large-mnli model loaded from Hugging Face. I followed the steps outlined in PyTorch’s example notebook ((beta) Dynamic Quantization on BERT — PyTorch Tutorials 2.2.0+cu121 documentation).
Specifically, the quantization was applied exclusively to nn.Linear layers, with a dtype of qint8.
However, a challenge has emerged: the output logits of the quantized model seem to be affected by the batch size. Notably, the output for a single example differs depending on whether the example is part of a batch or processed individually.
I would like to highlight that, for inference, I loaded the base model from Hugging Face, applied dynamic quantization, and then loaded the state dict with the quantized weights.
Is this behavior something expected, or am I overlooking a crucial aspect of quantization?