Why model size is reduced in Dynamic Quantization?

In the tutorials of Quantization, there is a mention that the model size would be reduced by using Dynamic Quantization(DQ).

After I merged the DQ codes, I found the model size is reduces(5MB->2MB). that what I expected.
However, I am wondering why the model is reduces.
so, I tried to log the model state_dict and the log is following.

the original modelโ€™s state_dict()

				'model.layers.3.residual_group.blocks.5.mlp.fc1.weight', tensor([[-0.0133, -0.0458, -0.0438,  ..., -0.0109,  0.0203, -0.0292],
				[ 0.0185,  0.0241,  0.0071,  ...,  0.0204,  0.0048, -0.0240],
				[-0.0027, -0.0198, -0.0116,  ..., -0.0246, -0.0079, -0.0145],
				...,
				[-0.0086,  0.0161,  0.0068,  ...,  0.0200,  0.0013, -0.0164],
				[ 0.0080, -0.0006, -0.0074,  ...,  0.0420, -0.0109,  0.0062],
				[-0.0169,  0.0129,  0.0252,  ..., -0.0208, -0.0016, -0.0064]])), ('model.layers.3.residual_group.blocks.5.mlp.fc1.bias', tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
				0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
				0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
				0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
				0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])), 

this is DQโ€™s state_dict()

				('model.layers.3.residual_group.blocks.5.mlp.fc1.scale', tensor(1.)), 
				
				('model.layers.3.residual_group.blocks.5.mlp.fc1.zero_point', tensor(0)), 
				
				('model.layers.3.residual_group.blocks.5.mlp.fc1._packed_params.dtype', torch.qint8), 
				
				('model.layers.3.residual_group.blocks.5.mlp.fc1._packed_params._packed_params', (tensor([[-0.0131, -0.0459, -0.0435,  ..., -0.0107,  0.0203, -0.0292],
				[ 0.0185,  0.0238,  0.0072,  ...,  0.0203,  0.0048, -0.0238],
				[-0.0024, -0.0197, -0.0119,  ..., -0.0244, -0.0078, -0.0143],
				...,
				[-0.0083,  0.0161,  0.0066,  ...,  0.0197,  0.0012, -0.0161],
				[ 0.0078, -0.0006, -0.0072,  ...,  0.0417, -0.0107,  0.0060],
				[-0.0167,  0.0131,  0.0250,  ..., -0.0209, -0.0018, -0.0066]],
			   size=(120, 60), dtype=torch.qint8,
			   quantization_scheme=torch.per_tensor_affine, scale=0.0005962323630228639,
			   zero_point=0), tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
				0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
				0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
				0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
				0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
			   requires_grad=True))), 

I think DQโ€™s weight/bias looks like fp32 not int8.
in this reason, I wonder how to reduce model size by using fp32 weight format.
could you tell me why the this is happend?

Thank you for your help.

The answer is in your question.

fp32 is a floating point expressed by 32bits.
qint8 is a quantized integer expressed by 8bits.

In the aspect of computer science, the expected size of model would be 1.25MB but not 2MB.
However, there is additional information which contains a quantization scheme and not all the tensors inside the model can be converted into qint8.

For more info, please read the docs Quantization โ€” PyTorch 1.10.1 documentation

์ž‘์„ฑํ•˜์‹  ์ฝ”๋“œ์— ์˜ํ•˜๋ฉด

์–‘์žํ™” ์ „์˜ ๋ชจ๋ธ์€ ๋ฐ์ดํ„ฐ ํƒ€์ž…์ด float32์ด๊ณ 
์–‘์žํ™” ์ดํ›„ ๋ชจ๋ธ์˜ ๋ฐ์ดํ„ฐ ํƒ€์ž…์€ qint8 ์ž…๋‹ˆ๋‹ค.

float32๋Š” 32bits, qint8์€ 8bits์˜ ํฌ๊ธฐ ์ž„์œผ๋กœ ํฌ๊ธฐ๊ฐ€ ์ค„์–ด๋“œ๋Š”๊ฒƒ ์ž…๋‹ˆ๋‹ค.

float32๋ฅผ ์œ ์ง€ ํ•˜๋ฉด์„œ ๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” pruning, distillation ์ •๋„๊ฐ€ ์žˆ์„๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

@thecho7 , @seungtaek94
Thank you for your reply. :slight_smile:

๋จผ์ € ๊ผผ๊ผผํ•œ ๋‹ต์‹  ์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
์ œ ์งˆ๋ฌธ์ด ์กฐ๊ธˆ ๋ถˆ๋ช…ํ™•ํ•˜๊ฒŒ ํ‘œํ˜„๋œ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, ์•„๋ž˜์™€ ๊ฐ™์€ fp32 weight๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ

torch.float32
tensor([-1.0000,  0.3520,  1.3210,  2.0000])

ํ•ด๋‹น ๋ฐ์ดํ„ฐ๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ์–‘์žํ™” ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
(์•„๋ž˜ ํฌ๋งท์ด Torch์—์„œ ์ œ๊ณตํ•˜๋Š” Dynamic Quantization model์˜ ๋ ˆ์ด์–ด/๋ฐ์ดํ„ฐ ํ‘œํ˜„๋ฒ•์œผ๋กœ ํ™•์ธํ•˜์˜€์Šต๋‹ˆ๋‹ค.

torch.quint8
tensor([-1.0000,  0.4000,  1.3000,  2.0000], size=(4,), dtype=torch.quint8,
       quantization_scheme=torch.per_tensor_affine, scale=0.1, zero_point=10)

์ด ๋•Œ, weight์˜ datetype์€ ๋ง์”€์ฒ˜๋Ÿผ "quint8"์ด๋ผ fp32์— ๋น„ํ•ด ์ ์€ ๊ณต๊ฐ„์„ ์‚ฌ์šฉํ•จ์œผ๋กœ ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ๊ฐ€ ์ค„์–ด๋“ ๋‹ค๋Š” ๊ฒƒ์€ ์ดํ•ดํ•˜์˜€์Šต๋‹ˆ๋‹ค.
๋‹ค๋งŒ, logging์„ ํ•  ๋•Œ ์™œ ์ •์ˆ˜(int)๊ฐ€ ์•„๋‹Œ ์†Œ์ˆ˜(float)์œผ๋กœ ํ‘œํ˜„๋˜๋Š”์ง€๊ฐ€ ์˜๋ฌธ์ด์—ˆ์Šต๋‹ˆ๋‹ค.
์–‘์žํ™”๋ฅผ ๊ฑฐ์นœ weight์˜ value๋Š” Quant/DeQuant(fp32->int8->fp32) ์ž‘์—…์œผ๋กœ ์ธํ•ด์„œ quantization noise๊ฐ€ ์กฐ๊ธˆ ์„ž์ธ ๊ฐ’์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค.

์งˆ๋ฌธ์„ ์ •๋ฆฌํ•˜์ž๋ฉด ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  1. Dynamic Quantization์—์„œ ๋ชจ๋ธ์„ ์ €์žฅํ•  ๋•Œ๋Š” int๋กœ ์ €์žฅํ•˜๊ณ , load๋ฅผ ํ•  ๋•Œ์— fp32๋กœ convert๋ฅผ ํ•˜๋Š”๊ฑด๊ฐ€์š”? (๋งŒ์•ฝ ๋งž๋‹ค๋ฉด, ๊ทธ ์ด์œ ๋Š” ๋ฌด์—‡์ธ๊ฐ€์š”?)
  2. Dynamic Quantization์—์„œ๋Š” int ์—ฐ์‚ฐ์— ์ตœ์ ํ™”๋œ Quantized_operators๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๊ฑด๊ฐ€์š”?

๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

First of all, Thank you for your detail reply.
I think my question is a little bit unclear.

for example, if there are weights of fp32 datatype as below

torch.float32
tensor([-1.0000,  0.3520,  1.3210,  2.0000])

the data can be quantized as follows

torch.quint8
tensor([-1.0000,  0.4000,  1.3000,  2.0000], size=(4,), dtype=torch.quint8,
       quantization_scheme=torch.per_tensor_affine, scale=0.1, zero_point=10)

I understand the model size can be reduced by using less space compared to fp32.
Because the datatype of the weight is converted fp32 to quint8.

However, I was wondering why it is expressed as a decimal number (float) instead of an integer (int) when logging.

To summarize the questions:

  1. When saving a model in Dynamic Quantization, does PyTorch Quant lib save it as int and convert it to fp32 when loading? (If yes, why?)
  2. Does Dynamic Quantization not use Quantized_operators optimized for int operations?

thank you.

  1. Pytorch ์ฝ”๋“œ๋ฅผ ๋ดค๋Š”๋ฐ ๊ทธ๋ƒฅ ๊ทœ์น™์ธ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.
    pytorch/_tensor_str.py at master ยท pytorch/pytorch ยท GitHub

  2. ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

However, I was wondering why it is expressed as a decimal number (float) instead of an integer (int) when logging.

For the same reason you donโ€™t output a bunch of 1โ€™s and 0โ€™s when you print an fp32 number. All these numbers are just 1โ€™s and 0โ€™s in your computer. Its what they represent thatโ€™s important. You never print our an fp32 number, you print out the decimal representation of the 1โ€™s and 0โ€™s. The โ€˜non integerโ€™ values that are being output for qint8 are the same, those are the actual values that it represents. In order to do efficient computations with these values, it utilizes some aspects of int8 data storage/ops, but thats only really important if you are trying to mess with things at a really low level. For most purposes its better to just think of qint8 as similar to fp16 i.e. not limited to integers, takes up less space, has lower fidelity compared to fp32.

  1. When saving a model in Dynamic Quantization, does PyTorch Quant lib save it as int and convert it to fp32 when loading? (If yes, why?)

They are stored as qtensors (qint8) which, as mentioned before, is different from int8. https://github.com/pytorch/pytorch/blob/3e43c478a8832cec063aa566583a05f87d7dc3b0/torch/nn/quantized/modules/linear.py#L230 This is the deserialization code which takes the weight and bias and then packs them into the format that is necessary for the efficient computation of Linear.

  1. Does Dynamic Quantization not use Quantized_operators optimized for int operations?

Yes it does generally use int8 operations (in part) when performing quantized operations. Here is a reasonable explanation of how qint8 Linear can be broken down into other ops, including int8 ops, to speed things up: Quantization for Neural Networks - Lei Mao's Log Book (the first part explain quantization at a reasonable level, the part Iโ€™m talking about though is the Quantized Matrix Multiplication section)

2 Likes