Torchscript with dynamic quantization produces inconsistent model outputs in Python and Java

Hello, I’ve been experimenting with torchscript and dynamic quantization and often have the issue that results of the models that are dynamically quantized are not consistent between Python and Java.
To reproduce the issue I created a fork of the python java-demo: GitHub - westphal-jan/java-demo.

To setup you need to download libtorch and set its location in build.gradle (line 16).
Download Link: https://download.pytorch.org/libtorch/cpu/libtorch-shared-with-deps-1.13.1%2Bcpu.zip

I created a simple dummy model with one linear layer and export it unquantized and quantized here: create_dummy_models.py
(The code can also be run using the dependencies defined in requirements.txt but I also commited the dummy models)

Python:

Unquantized model:
[[-2.758167028427124, 2.0038578510284424, -4.114053726196289, -1.2928203344345093, 1.4940322637557983]]
Quantized model:
[[-2.747678756713867, 1.9912285804748535, -4.110795021057129, -1.2891944646835327, 1.4982664585113525]]

You can run the java code with ./gradlew run.
Java:

Unquantized model:
data: [-2.758167, 2.0038579, -4.1140537, -1.2928203, 1.4940323]
[W qlinear_dynamic.cpp:239] Warning: Currently, qnnpack incorrectly ignores reduce_range when it is set to true; this may change in a future release. (function apply_dynamic_impl)
Quantized model:
data: [-2.7473624, 1.9966378, -4.110954, -1.283469, 1.4918814]

As you can see the output of the unquantized model is perfectly consistent while the output of the dynamically quantized model is slightly inconsistent. It might seem insignificant but with larger models like a transformer it becomes more obvious (differences usually already in the first decimal place). Am I misunderstanding something conceptually?

I thought as the code is compiled down to C++ and both examples run on the same architecture (CPU, x86_64) it should produce the same output even when using dynamic quantization (the activations are computed on the fly but they should still be deterministic).

Note: I made sure that Python and Java use the same version of Torch 1.13.1 which is the latest published mvn version (mvnrepository → org.pytorch/pytorch_java_only)

I’m wondering if quantized engine is set to be the same in both cases? in python you can print it by: print(torch.backends.quantized.engine), in java I’m not exactly sure how to do that though, but maybe you can look for similar apis

Very interesting idea. I checked and set the quantization engine in python to qnnpack.
torch.backends.quantized.engine = "qnnpack"
In this case, I get the same output as in Java. Even the warning message is included which was already hinting on qnnpack.

Unquantized model:
[[-2.758167028427124, 2.0038578510284424, -4.114053726196289, -1.2928203344345093, 1.4940322637557983]]
Quantized model:
<repo-location>/venv/lib/python3.8/site-packages/torch/nn/modules/module.py:1194: UserWarning: Currently, qnnpack incorrectly ignores reduce_range when it is set to true; this may change in a future release. (Triggered internally at ../aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp:239.)
  return forward_call(*input, **kwargs)
[[-2.7473623752593994, 1.9966378211975098, -4.1109538078308105, -1.2834689617156982, 1.4918813705444336]]

Therefore, it seems that Java uses qnnpack as default quantization engine. Do you know why this is the case for Java even though it’s running on x86_64? Is it because of libtorch?

Or maybe you can help me with a different question. Why do these two backends produce different outputs at all?

not exactly sure about this but I think it is related to compilation flags etc.

the reason why two backends produce two different outputs is because in fbgemm, linear/conv are implemented with a special instruction that only works for 7 bit activations (https://github.com/pytorch/pytorch/blob/main/test/quantization/core/test_quantized_op.py#L44), if you have 8 bit activation, it will have overflows, while qnnpack does not have this restriction.

Thank you for the clarification. Is where a way to have the same behavior under both backends? I would rather have the same output and would be willing to lose a little bit of accuracy. Otherwise it is a hassle to keep track of model outputs for both backends.

you could just set the activation for qnnpack to 7 bits to get them to match that part, its unclear if they’ll match outputs exactly but that’d align them as much as possible. Generally you would accomplish this by using the fbgemm qconfigs for qnnpack for the ops you wish to align.

From my experiments so far it is not possible to set the activation for qnnpack to 7 bits to match fbgemm using dynamic configs.
First you would use the QConfig for the respective backend:

The only difference being whether reduce_range is set to true or false for the activations. This makes sense regarding what we discussed so far.
However, important is how this parameter is used during inference in the respective backend (which is the backend that is active when the model is loaded). Unfortunately, in the case of qnnpack the parameter reduce_range is not used which makes it impossible to match the behavior of fbgemm.

I see, not sure what is the status for qnnpack, whether it’s in maintenance or it’s still possible to add new support. cc @kimishpatel @digantdesai