Dynamic Quantization produces inconsistent outputs

Dear PyTorch community,

I recently used PyTorch’s dynamic quantization technique to quantize the weights of a fine-tuned roberta-large-mnli model loaded from Hugging Face. I followed the steps outlined in PyTorch’s example notebook ((beta) Dynamic Quantization on BERT — PyTorch Tutorials 2.2.0+cu121 documentation).

Specifically, the quantization was applied exclusively to nn.Linear layers, with a dtype of qint8.

However, a challenge has emerged: the output logits of the quantized model seem to be affected by the batch size. Notably, the output for a single example differs depending on whether the example is part of a batch or processed individually.

I would like to highlight that, for inference, I loaded the base model from Hugging Face, applied dynamic quantization, and then loaded the state dict with the quantized weights.

Is this behavior something expected, or am I overlooking a crucial aspect of quantization?

Thank you.

Yeah that’s expected. Basically dynamic quantization looks at the tensor you give it and calculates the min and max values of all elements and uses that to quantized the activation. As the range gets larger, it quantizes each individual value less accurately. If you have outliers, this can affect things massively

You can get around this static quantization to a point, but then you doin’t handle non-static distributions as well.