Hi, I am using the dynamic quantization on my model, and trying to compute the size reduced. I suppose the number of parameters before and after quantization should be the same, only the number of bits used is changed. However, when I check the model state_dict or parameters, the quantized is not there, and it is not in the buffer. I was trying to estimate the model size by numel * element_size, but it seems the quantized param cannot be found. I saw the document for pruning that I can compute the number of pruned params based on zeros in the _mask in the buffer, but I cannot find docs about how the quantized param is saved. Could anyone please help here, thanks.
Hi @SenJia, can you share the code you are using to quantize / save the model?
see this tutorial:
which includes a method for looking at the size of the model.
in general the quantized weight is not simply saved as a quantized tensor with X elements each having Y bits, rather it has to be saved as packedparams which include other intermediate values needed by the quantized matmul to speed up quantization.