Hello,
I’m experimenting with different quantization techniques on my LSTM-based speech model. Dynamic quantization on the LSTM works great out-of-the-box with minimal degradation in performance:
model = torch.ao.quantization.quantize_dynamic(
model, # the original model
{torch.nn.LSTM}, # a set of layers to dynamically quantize
dtype=torch.qint8) # the target dtype for quantized weights
but I never managed to make static PTQ or QAT work. They always produce extremely bad results.
For my tests, I was only quantizing the LSTM part of my model. The part of the code in question is the following:
norm_mix = self.layer_norm(mix)
norm_mix = self.quant_norm(norm_mix)
output, _ = self.rnn(norm_mixture_w)
output = self.dequant_out(output)
..................
model.rnn.qconfig = torch.ao.quantization.default_qconfig
After calibration (I’ve tested with both small and large datasets), I noticed while debugging that the output of the LSTM layer is largely equivalent between different batches. It’s probably because the QuantStub that precedes it is outputting a small and concentrated distribution of numbers, for example:
norm_mixture_w.shape
torch.Size([7, 2249, 500])
torch.int_repr(norm_mix).unique(return_counts=True)
(tensor([ 10, 16, 21, 23, 25, 26, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91,
92, 93, 94, 95, 96, 97, 99, 101, 102, 103, 107, 108, 112],
dtype=torch.uint8), tensor([ 1, 1, 1, 1, 2, 4, 3, 6,
9, 10, 18, 21, 35, 43, 56, 91,
148, 4762, 319083, 1906730, 4440142, 456188, 238252, 149636,
101316, 70079, 48860, 35018, 25053, 18482, 13596, 10197,
7699, 5689, 4353, 3216, 2544, 1864, 1598, 1198,
940, 748, 591, 522, 398, 354, 301, 254,
197, 196, 155, 122, 102, 92, 71, 60,
64, 57, 31, 48, 27, 26, 22, 16,
19, 9, 12, 11, 10, 9, 2, 1,
3, 5, 3, 3, 2, 3, 2, 2,
2, 1, 2]))
I’m aware that RNNs are better off with dynamic quantization, but is what I’m experiencing normal? I wanted to try per_channel quantization but it doesn’t seem to be supported for LSTMs and I’m not sure if it’s logical/possible to apply it to the QuantStub individually.
Any help is very appreciated. And if someone has additional resources on Quantization, even though the torch docs are great but I feel like there’s much more about to the subject since the API seems much richer than shown, I would also very much appreciate it.
Thanks.