Missing weight_fake_quant with prepare() vs prepare_qat()

shu_bekk · September 17, 2024, 9:21pm

Hi

I want to understand how quantization parameters are stored in PyTorch.
Consider the following toy example:

import torch
import torch.nn as nn

torch.random.manual_seed(0)

# Toy model. Two Linear layers & a ReLU
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.a = nn.Linear(1, 8)
        self.act = nn.ReLU()
        self.b = nn.Linear(8, 2)
        self.quant = torch.ao.quantization.QuantStub()
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)

        x = self.a(x)
        x = self.act(x)
        x = self.b(x)

        x = self.dequant(x)
        return x

# Create original (unprepared model)
m_orig = Model()
print('Original model', m_orig)

# Create prepared model
m_orig.qconfig = torch.ao.quantization.get_default_qat_qconfig()
m = torch.ao.quantization.prepare(m_orig, inplace=False)
print('Prepared', m)

# Convert to quantized model
qm = torch.ao.quantization.convert(m, inplace=False)

In this case, the weight_fake_quant keys are missing from the state dict.

...
  (a): Linear(
    in_features=1, out_features=8, bias=True
    (activation_post_process): FusedMovingAvgObsFakeQuantize(
      fake_quant_enabled=tensor([1]), observer_enabled=tensor([1]), scale=tensor([1.]), zero_point=tensor([0], dtype=torch.int32), dtype=torch.quint8, quant_min=0, quant_max=127, qscheme=torch.per_tensor_affine, reduce_range=True
      (activation_post_process): MovingAverageMinMaxObserver(min_val=inf, max_val=-inf)
    )
  )
...

However, if I replace the prepare() with prepare_qat(), then these keys reappear

...
  (a): Linear(
    in_features=1, out_features=8, bias=True
    (weight_fake_quant): FusedMovingAvgObsFakeQuantize(
      fake_quant_enabled=tensor([1]), observer_enabled=tensor([1]), scale=tensor([1.]), zero_point=tensor([0], dtype=torch.int32), dtype=torch.qint8, quant_min=-128, quant_max=127, qscheme=torch.per_channel_symmetric, reduce_range=False
      (activation_post_process): MovingAveragePerChannelMinMaxObserver(min_val=tensor([]), max_val=tensor([]))
    )
    (activation_post_process): FusedMovingAvgObsFakeQuantize(
      fake_quant_enabled=tensor([1]), observer_enabled=tensor([1]), scale=tensor([1.]), zero_point=tensor([0], dtype=torch.int32), dtype=torch.quint8, quant_min=0, quant_max=127, qscheme=torch.per_tensor_affine, reduce_range=True
      (activation_post_process): MovingAverageMinMaxObserver(min_val=inf, max_val=-inf)
    )
  )
...

This is strange behavior to me. I would expect both prepare() and prepare_qat() to add weight quantization, but this isn’t the case. I guess I’m still trying to understand the differences between these two functions. Why wouldn’t prepare() also quantize the weights? Which one should I be using if I want to obtain quantized weights?