Poor performance of dynamic quantization

estrulyov · June 13, 2020, 2:13am

Hi all,

I read about dynamic quantization and wanted to replicate it in our environment. I actually got pretty good results… when using a single thread. However, we have a multithreaded GRPC application and I did not see any improvement when using multiple threads. In fact with 10 threads, the original (non-quantized) model actually had better throughput, although it was also using 5x CPU.

So I started to wonder if locking is to blame, specifically python’s GIL. Profiling our code revealed that it spends by far the majority of its time in linear_dynamic function which is defined here: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp

I noticed that code in python_torch_functions.cpp makes an explicit call to release the GIL:

pybind11::gil_scoped_release no_gil;

But I don’t see the same call in qlinear_dynamic.cpp

I tried to add this myself but I am having difficulty compiling torch from source. It’s much more complicated than the documentation makes it out to be. I also don’t have enough information to determine whether apply_dynamic_impl() method is thread-safe. Can any pytorch engineers comment? There is potentially a massive performance bottleneck here.

thanks,

Eugene