FBGEMM with PyTorch Mobile

pshashk · November 12, 2019, 2:15pm

Is it possible to run a model with fbgemm qconfig on mobile? Or is it x86 only? Simply plugging such model into demo app triggers qnnpack assert here https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp#L223
It seems like FBGEMM support was disabled by this commit for some reason https://github.com/pytorch/pytorch/commit/6fead9afd4cdc6306fb0e2180ca625160b59ea71

pshashk · November 12, 2019, 3:21pm

I wasn’t able to get good results from QNNPACK compatible per tensor quantization qconfig. Target metric value relative to fp32 model:
get_default_qconfig('fbgemm') -> 99.8%
get_default_qconfig('qnnpack') -> 58.5%
default_qconfig -> 54.4%
Is there any way to reduce that gap without changing architecture?

raghuramank100 · November 12, 2019, 6:56pm

FBGEMM is supported only for x86. You can get very good accuracies for qnnpack also.
Please make sure that when you set:

qconfig = torch.quantization.get_default_qconfig('qnnpack')

You also do:

torch.backends.quantized.engine = 'qnnpack'

before running the model.
The poorer accuracy numbers are likely due to FBGEMM saturating for large weight/activation values, due to this issue:

github.com

intel/mkl-dnn/blob/f38fecf5b76421fe277cfb15ec1d5090f1d30c07/doc/advanced/int8_computations.md#1-inputs-of-mixed-type-u8-and-s8

Int8 Computation Aspects {#dev_guide_int8_computations}
=======================================================

> This document uses **int8** to denote 8-bit integer no matter whether it is
> signed or unsigned. To emphasize the signedness of the data type
> **u8** (`uint8_t`) or **s8** (`int8_t`) are used. In particular, if a
> primitive has two inputs the types would be written using "/". For instance:
> - int8 GEMM denotes any integer GEMM with 8-bit integer inputs, while
> - u8/s8 GEMM denotes dnnl_gemm_u8s8s32() only.

The operation primitives that work with the int8 data type
(#dnnl::memory::data_type::s8 and #dnnl::memory::data_type::u8)
typically use s32 (`int32_t`) as an intermediate data type
(#dnnl::memory::data_type::s32) to avoid integer overflows.

For instance, the int8 average [pooling](@ref dev_guide_pooling) primitive
accumulates the int8 input values in a window to an s32 accumulator, then
divides the result by the window size, and then stores the result back to the
int8 destination:

This file has been truncated. show original

pshashk · November 13, 2019, 7:21am

Thanks. With engine set preparation, calibration, and conversion of the model work fine. But evaluation triggers errors like: Error in QNNPACK: failed to create convolution with 0.1966128 input scale, 1.698165 kernel scale, and 0.2075303 output scale: convolution scale 1.608829 is greater or equal to 1.0. The cause seems to be in the SE block implemented via 1x1 convolution that receives 1x1 input. I probably should have used Linear anyway, but maybe it will be useful to someone.

pshashk · November 13, 2019, 3:21pm

Ok, I’ve managed to get good result from QNNPACK. Maybe torch.backends.quantized.engine should be mentioned somewhere on quantization page?

supriyar · November 13, 2019, 5:50pm

Great that this worked! We will make sure to mention this on our quantization page. Thanks for the suggestion! cc @raghuramank100