Extremely bad LSTM Static Quantization performance compared to Dynamic

Khumbaba · January 11, 2024, 9:42am

Hello,

I’m experimenting with different quantization techniques on my LSTM-based speech model. Dynamic quantization on the LSTM works great out-of-the-box with minimal degradation in performance:

model = torch.ao.quantization.quantize_dynamic(
    model,  # the original model
    {torch.nn.LSTM},  # a set of layers to dynamically quantize
    dtype=torch.qint8)  # the target dtype for quantized weights

but I never managed to make static PTQ or QAT work. They always produce extremely bad results.

For my tests, I was only quantizing the LSTM part of my model. The part of the code in question is the following:

norm_mix = self.layer_norm(mix)
norm_mix = self.quant_norm(norm_mix)
output, _ = self.rnn(norm_mixture_w)
output = self.dequant_out(output)
..................
model.rnn.qconfig = torch.ao.quantization.default_qconfig

After calibration (I’ve tested with both small and large datasets), I noticed while debugging that the output of the LSTM layer is largely equivalent between different batches. It’s probably because the QuantStub that precedes it is outputting a small and concentrated distribution of numbers, for example:

norm_mixture_w.shape
torch.Size([7, 2249, 500])

torch.int_repr(norm_mix).unique(return_counts=True)
(tensor([ 10,  16,  21,  23,  25,  26,  28,  29,  30,  31,  32,  33,  34,  35,
         36,  37,  38,  39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,
         50,  51,  52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,
         64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
         78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
         92,  93,  94,  95,  96,  97,  99, 101, 102, 103, 107, 108, 112],
       dtype=torch.uint8), tensor([      1,       1,       1,       1,       2,       4,       3,       6,
              9,      10,      18,      21,      35,      43,      56,      91,
            148,    4762,  319083, 1906730, 4440142,  456188,  238252,  149636,
         101316,   70079,   48860,   35018,   25053,   18482,   13596,   10197,
           7699,    5689,    4353,    3216,    2544,    1864,    1598,    1198,
            940,     748,     591,     522,     398,     354,     301,     254,
            197,     196,     155,     122,     102,      92,      71,      60,
             64,      57,      31,      48,      27,      26,      22,      16,
             19,       9,      12,      11,      10,       9,       2,       1,
              3,       5,       3,       3,       2,       3,       2,       2,
              2,       1,       2]))

I’m aware that RNNs are better off with dynamic quantization, but is what I’m experiencing normal? I wanted to try per_channel quantization but it doesn’t seem to be supported for LSTMs and I’m not sure if it’s logical/possible to apply it to the QuantStub individually.

Any help is very appreciated. And if someone has additional resources on Quantization, even though the torch docs are great but I feel like there’s much more about to the subject since the API seems much richer than shown, I would also very much appreciate it.

Thanks.

HDCharles · January 11, 2024, 7:37pm

static and dynamic quantization primarily differ in how they quantize the input tensor.

static picks a single range of values to focus on, it learns this range during calibration. If your model has inputs with values clustered in some region, but then often has outliers, say like 95% elements within [-1, 1] and 5% elements with value like 100-1000, then static quantization works well because it choses to focus on the range of values in the 95% and ignore the 5% (they just get treated the same as the top of the range, i.e. like 1 in this example).

But if you have a situation where you have some inputs having elements from -1000 to 1000 and another input having elements from -1 to 1. It can only pick a single range, so it picks -1000 to 1000 and then when it hits -1 to 1, it performs poorly because all those values get quantized to the same integer, it lacks the fidelity to deal with the difference between -1 and -.9

Dynamic quantization is the opposite, it doesn’t use calibration, it just looks at the input during run time, says “ok all values are between A and B, that’s the range i’ll use”. This means if the range of your inputs is constantly change, it will work great, but if you have a few outliers that massively affect the range, it won’t know to ignore them so it will work poorly.

LSTM (and speech in particular) tends to work well with dynamic quantization because in my experience you often see large shifts in the distribution of the input data. If you see like
-10, -1, 0, 1, 10 and -100 -10, 0, 10, 100 those are hte same input just scaled, static has a hard time handling that, dynamic does not. From that you would see the outputs of the quant stub looking the same, but the scale would differ. Your analysis above didn’t show the scales/zeros so i’m not sure if that’s the guess, but it would be my guess.

Secondly, LSTMs have a lot of non-quantized ops in them. Static quantization is designed for sequences of quantized ops, like linear->relu->linear->relu all quantized. Dynamic Quantization is designed for one off ops that are quantized. The speedup advantage of Static over Dynamic is fairly small for LSTMs.

QAT works best when static quantization works best. If you don’t have static distributions static+qat is probably not going to work great anyway. You can do QAT+dynamic quantization though its rather involved and not really supported.

Most of the information about quantization from our team can be found Quantization — PyTorch 2.1 documentation (cpu focus)

though we’re also developing GPU focused quantization in GitHub - pytorch-labs/ao: The torchao repository contains api's and workflows for quantization and pruning gpu models. (also dynamic quantization)

Khumbaba · January 12, 2024, 3:06pm

Thank you for your input, your explanation was super clear and helpful!

The thing is, input audio waves are normalized, additionally, I’m applying Layer Normalization before the LSTM. So to my logic, there shouldn’t be a large shift in data distribution coming to the LSTM. Is that a valid analysis or am I missing your point?

The quant stub has the following params: scale=0.15195775032043457 zero_point=42

I see!

HDCharles · January 12, 2024, 8:39pm

I’m not sure, we were looking at Fourier transformed signals and saw a lot of variation when i was working on an audio processing model.

yeah, its possible that would help make it more static but its not super clear. You’d have to analyze the actual data stream as if your were applying dynamic quantization to it.