Can we calculate gradients for quantized models?

yyl-github-1896 · October 6, 2021, 8:20am

I tried to use torch.autograd.grad() to calculate gradients for a quantized model, just as what we usually do on full precision models:

    for idx, (inputs, targets) in enumerate(data_loader):
        with torch.enable_grad():
            inputs.requires_grad = True
            outputs = quantized_model(inputs)
            loss = criterion(outputs, targets)
            grads = torch.autograd.grad(loss, inputs)[0]

But I got a RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Does models quantized with PyTorch Quantization currently do not support backpropagation? Is there some methods I can calculate the gradients for PyTorch quantized models?

supriyar · October 8, 2021, 5:21pm

Hi @yyl-github-1896

quantized models currently run only during inference so you can only call forward on them. If you are trying out quantization aware training Quantization Recipe — PyTorch Tutorials 1.9.1+cu102 documentation, we do support back-propagation in that case during training.

yyl-github-1896 · October 10, 2021, 3:08am

Thank you for the reply. I know that quantization aware training use fake quantization during training, which simulates quantization with fp32. I want to know that what is the difference between fake quantization and real quantization, especially when we do back-propagation on them?

HDCharles · October 14, 2021, 10:41pm

fake quantization simulates quantization but uses high precision data types

so for example imagine if you were trying to quantize to integers.

mathematically a quantized linear op would be:

X = round(X).to(int)
weight = round(weight).to(int)
out = X*weight

whereas a fake_quantized linear would be
X = round(X).to(fp32)
weight = round(weight).to(fp32)
out = X*weight

In practice quantized weights are stored as quantized tensors which are difficult to interact with in order to make them able to perform quantized operations quickly.

fake_quantized weights are stored as floats so you can interact with them easily in order to do gradient updates.

most quantized ops do not have a gradient function so you won’t be able to take a gradient of it. Note: even quantization aware training doesn’t really give gradients of the model, see: [1308.3432] Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

yyl-github-1896 · October 15, 2021, 7:19am

Thank you so much for your clear explanation ! : )