Right way to insert QuantStub and DeQuantStub in eager mode quantization

I note many examples of eager mode quantization only insert QuantStub and DeQuantStub once to the beginning and end of the whole network, like this tutorial on quantization aware training and the torchvision resnet model. What is the different between inserting QuantStub/DeQuantStub only once to the whole model and inserting a pair of QuantStub/DeQuantStub to each submodule? Which way is better for eager mode quantization? Thank you for your help!

eager mode is a bit deprecated, please checkout our new flow: Quantization — PyTorch main documentation

and new repo focusing on GPU and LLM quantization: GitHub - pytorch/ao: PyTorch native quantization and sparsity for training and inference

Thank you for pointing me to the new quantization flow. For some reason, I have to stick with an older version of pytorch (1.13). Eager mode would be a good temporal solution for my problem for now. Any suggestions on my question on QuantStub/DeQuantStub? Thank you!

OK, so I assume inserting qstub/dqstub to the whole model means you insert quantstub for input and dequantstub for output right, that means all the operators, modules in the model should be quantized for this to work. this is typically more performant.

inserting that for each submodule means the submodule can stay unquantized, like
QuantStub → DequantStub → conv → QuantStub → DeQuantStub, this will be slower than the original model due to extra q/dq ops, but can be used to simulate the accuracy impact of quantization when there are no real quantized operators implemented

I assume inserting qstub/dqstub to the whole model means you insert quantstub for input and dequantstub for output right

Thank you for your explanation and sorry for my late reply! And yes, inserting qstub/dqstub to the whole model means to insert quantstub for input and dequantstub for output.

I have a follow-up question. When quantizing a model and activations to int8, we often have to figure out two parameters: scaling and zero point. These two parameters could be set different for different layers (e.g. different scalings and zero points for the output of each conv-batch-relu). In eager mode, are these two parameters computed at where I insert QuantStub/DeQuantStub, or does pytorch decide automatically where different scalings/zero points should be estimated (e.g. after each conv-batch-relu)? In other words, if I only insert quantstub for input and dequantstub for output, will all layers between input and output have their own scalings and zero points estimated, or I have to control where to have new scaling/zero points using QuantStub/DeQuantStub? Thank you for your help!

pytorch decide automatically where different scalings/zero points should be estimated (e.g. after each conv-batch-relu)?

pytorch decides scale/zero_point based on qconfig that’s attached to the module, and conv-bn-relu fusion is not handled automatically in eager mode, it’s handled by manually calling fuse_modules API: pytorch/torch/ao/quantization/fuse_modules.py at 77407b38a994c4a4de14b2253d81d2fed17cc36d · pytorch/pytorch · GitHub

all of these are done automatically in later versions of quantization

I’d suggest you to just look at the code of prepare: pytorch/torch/ao/quantization/quantize.py at 77407b38a994c4a4de14b2253d81d2fed17cc36d · pytorch/pytorch · GitHub and convert: pytorch/torch/ao/quantization/quantize.py at 77407b38a994c4a4de14b2253d81d2fed17cc36d · pytorch/pytorch · GitHub, it’s relatively straightforward.

to understand what it’s doing, we don’t have very good doc to walk through what happens during eager mode quantization I think.

1 Like