Dynamic Quantization not reducing model size

Hi

I am trying to use the quantize_dynamic module, but the size of the quantized model is the same as the original model in my case.

import os
import torch
from craft import CRAFT
trained_model='craft_mlt_25k.pth'

def print_size_of_model(model):
    torch.save(model.state_dict(), "temp.p")
    print('Size (MB):', os.path.getsize("temp.p")/1e6)
    os.remove('temp.p')
    
from collections import OrderedDict
def copyStateDict(state_dict):
    if list(state_dict.keys())[0].startswith("module"):
        start_idx = 1
    else:
        start_idx = 0
    new_state_dict = OrderedDict()
    for k, v in state_dict.items():
        name = ".".join(k.split(".")[start_idx:])
        new_state_dict[name] = v
    return new_state_dict

net = CRAFT()     # initialize
net.load_state_dict(copyStateDict(torch.load(trained_model, map_location='cpu')))
net.eval()
print_size_of_model(net)
quantized = torch.quantization.quantize_dynamic(net, dtype=torch.qint8)
print_size_of_model(quantized)

Size of both original and quantized model is 83.14 MB. Why doesn’t the model size change ?
Any suggestions would be appreciated

Similarily when I tried to quantize the pretrained model ResNet18 model from torchvision, size only changed from 46 MB to 45 MB.

3 Likes

Hi,
Dynamic quantization only helps in reducing the model size for models that use Linear and LSTM modules. For the case of resnet18, the model consists of conv layers which do not have dynamic quantization support yet. For your model, can you check if it has linear layers?

3 Likes

@raghuramank100
Thanks for your response.
But dynamic quantization is able to reduce model size from 553 MB to 182 MB, while VGG16 is mostly convolution layers, why such a drastic change then ?

it’s probably because most of the weights are in the last few Linear layers?

Yes, for the case of VGG16, the last two fc layers contain the bulk of the weights.

Hi,
I am trying to quantize a BERT but the size is not reducing, I wonder why, Can someone help me?

Here is the code snippet:-

import torch as torch

import os

from transformers.modeling_bert import BertConfig, BertForPreTraining, load_tf_weights_in_bert, BertModel

tf_checkpoint_path="./distlang/"
bert_config_file = “./config.json”
pytorch_dump_path="./distlangpytorch/"

device = “cpu”

torch.backends.quantized.engine = ‘qnnpack’

qconfig = torch.quantization.get_default_qconfig(‘fbgemm’)

print(qconfig)
config = BertConfig.from_json_file(bert_config_file)
print(“Building PyTorch model from configuration: {}”.format(str(config)))
model = BertModel.from_pretrained("./distlangpytorch/")

model.to(device)

torch.quantization.prepare(model)

quantized_model=torch.quantization.convert(model)

def print_size_of_model(model):
torch.save(model.state_dict(), “temp.p”)
print(‘Size (MB):’, os.path.getsize(“temp.p”)/1e6)
os.remove(‘temp.p’)

print_size_of_model(model)
print_size_of_model(quantized_model)

quantized_output_dir = “./quantized_model”
if not os.path.exists(quantized_output_dir):
os.makedirs(quantized_output_dir)

Hi @Sagar_Gupta,

In this mode of quantization the model has to be calibrated(evaluate your model after prepare()) to capture the qparams (zeropoint&scale)
Which are needed to quantize the model i.e weights and all

Hi @BOLLOJU_ARAVIND, Can you refer some guide for doing this?

Does Graph Mode quantization suffers from this issue as well? I recently tried to quantize a jit saved model and did not see any difference, the model size is nearly the same, but the forward pass has gotten worse (nearly two times slower)
What could be the underlying issue here?