Various quantized/quantizable/intrinsic modules purpose

PyTorch has evolved to have many different modules related to quantization helping in different workflows.

Can anyone explain what all of them are and in what usecases/flows are they useful (for a quantization novice it’s quite confusing)? (preferable with some examples as the official docs are too terse IMO) Almost all of these 9 namespaces contain e.g. a version of LinearReLU or Linear, so wondering how they are all LinearReLU’s different and useful. The most confusing for me is the duality of torch.ao.nn.quantizable+torch.ao.nn.quantized and of torch.ao.nn.intrinsic+torch.ao.nn.intrinsic.quantized . What is intrinsic designed to mean? Which of these are designed for the user to use directly / which are sort of internal? Are any of these deprecated?

The basic question of LinearReLU is quite important as it’s the workhorse of feed-forward MLPs found in transformers and is the first target when first considering quantizing the transformers…

Thank you!

  1. torch.ao.nn.intrinsic
  2. torch.ao.nn.intrinsic.qat
  3. torch.ao.nn.intrinsic.quantized
  4. torch.ao.nn.intrinsic.quantized.dynamic
  5. torch.ao.nn.qat
  6. torch.ao.nn.qat.dynamic
  7. torch.ao.nn.quantized.modules
  8. torch.ao.nn.quantizable
  9. torch.ao.nn.quantized.dynamic

Hi Vadim,

Yes, these namespaces can certainly be confusing! I’ll try to clarify what these mean on a high level:

  • Intrinsic means fused modules, like conv + relu, linear + relu, conv + bn etc. Some of these have special logic that changes the numerics (the ones with BN) and some don’t
  • QAT means quantization-aware training
  • dynamic means dynamic quantization
  • quantized means the ops will take in quantized inputs and produce quantized outputs
  • quantizable means the ops will take in fp32 inputs and produce fp32 outputs, but these modules will be swapped to the quantized versions during convert. These are only for special modules like LSTMs that have a C++ aten implementations that are not easy to quantize in python.

From the user’s perspective, you shouldn’t have to worry about the differences between them. The quantization workflow should take care of swapping between the modules for you automatically. Please let me know if I can help clarify anything else.

Best,
-Andrew

1 Like

Seems not fully accurate, as LinearReLU — PyTorch 2.0 documentation is in quantized namespace but in fact accepts float32 inputs as shown in the code example, same for quantized.linear_dynamic: [feature request] quantized and low-level int8 operators (matmul, gemm etc) on CUDA + integrate LLM.int8 + integrate ZeroQuant? · Issue #69364 · pytorch/pytorch · GitHub

Regarding quantizable, there’s only two modules: LSTM and MultiheadAttention. But it’s clear that more modules are replaced by convert, right? Some QuantStub or FakeQuant modules?

What would be useful is to group these namespaces somehow or reduce their amount. Maybe some “fused logical” modules can be put in torch.nn, especially for those that stand as placeholders and don’t do any quantization by themselves? This often happens in domain libraries like torchvision: ConvNormActivation and similar high-level modules.

I think more explanation is needed on internal details as the quantization abstraction “leaks” and it’s important for the enduser to understand what exactly is going on. And also there’s a lot of vendors’ replacement modules: nvidia pytorch_quantization toolkit and so on.

For now, as a novice it’s too many very similar-named namespaces and LinearReLU classes in them :slight_smile:

Maybe MLP/transformer feedfoward can be a good example / tutorial to showcase quantization and usage of LinearReLU