Why model size is reduced in Dynamic Quantization?

jhw4824 · January 18, 2022, 12:36am

In the tutorials of Quantization, there is a mention that the model size would be reduced by using Dynamic Quantization(DQ).

After I merged the DQ codes, I found the model size is reduces(5MB->2MB). that what I expected.
However, I am wondering why the model is reduces.
so, I tried to log the model state_dict and the log is following.

the original model’s state_dict()

				'model.layers.3.residual_group.blocks.5.mlp.fc1.weight', tensor([[-0.0133, -0.0458, -0.0438,  ..., -0.0109,  0.0203, -0.0292],
				[ 0.0185,  0.0241,  0.0071,  ...,  0.0204,  0.0048, -0.0240],
				[-0.0027, -0.0198, -0.0116,  ..., -0.0246, -0.0079, -0.0145],
				...,
				[-0.0086,  0.0161,  0.0068,  ...,  0.0200,  0.0013, -0.0164],
				[ 0.0080, -0.0006, -0.0074,  ...,  0.0420, -0.0109,  0.0062],
				[-0.0169,  0.0129,  0.0252,  ..., -0.0208, -0.0016, -0.0064]])), ('model.layers.3.residual_group.blocks.5.mlp.fc1.bias', tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
				0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
				0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
				0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
				0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])),

this is DQ’s state_dict()

				('model.layers.3.residual_group.blocks.5.mlp.fc1.scale', tensor(1.)), 
				
				('model.layers.3.residual_group.blocks.5.mlp.fc1.zero_point', tensor(0)), 
				
				('model.layers.3.residual_group.blocks.5.mlp.fc1._packed_params.dtype', torch.qint8), 
				
				('model.layers.3.residual_group.blocks.5.mlp.fc1._packed_params._packed_params', (tensor([[-0.0131, -0.0459, -0.0435,  ..., -0.0107,  0.0203, -0.0292],
				[ 0.0185,  0.0238,  0.0072,  ...,  0.0203,  0.0048, -0.0238],
				[-0.0024, -0.0197, -0.0119,  ..., -0.0244, -0.0078, -0.0143],
				...,
				[-0.0083,  0.0161,  0.0066,  ...,  0.0197,  0.0012, -0.0161],
				[ 0.0078, -0.0006, -0.0072,  ...,  0.0417, -0.0107,  0.0060],
				[-0.0167,  0.0131,  0.0250,  ..., -0.0209, -0.0018, -0.0066]],
			   size=(120, 60), dtype=torch.qint8,
			   quantization_scheme=torch.per_tensor_affine, scale=0.0005962323630228639,
			   zero_point=0), tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
				0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
				0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
				0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
				0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
			   requires_grad=True))),

I think DQ’s weight/bias looks like fp32 not int8.
in this reason, I wonder how to reduce model size by using fp32 weight format.
could you tell me why the this is happend?

Thank you for your help.

thecho7 · January 18, 2022, 12:53am

The answer is in your question.

fp32 is a floating point expressed by 32bits.
qint8 is a quantized integer expressed by 8bits.

In the aspect of computer science, the expected size of model would be 1.25MB but not 2MB.
However, there is additional information which contains a quantization scheme and not all the tensors inside the model can be converted into qint8.

For more info, please read the docs Quantization — PyTorch 1.10.1 documentation

seungtaek94 · January 18, 2022, 5:57am

작성하신 코드에 의하면

양자화 전의 모델은 데이터 타입이 float32이고
양자화 이후 모델의 데이터 타입은 qint8 입니다.

float32는 32bits, qint8은 8bits의 크기 임으로 크기가 줄어드는것 입니다.

float32를 유지 하면서 모델 크기를 줄이기 위한 방법으로는 pruning, distillation 정도가 있을것 같습니다.

jhw4824 · January 18, 2022, 11:51pm

@thecho7 , @seungtaek94
Thank you for your reply.

먼저 꼼꼼한 답신 주셔서 감사합니다.
제 질문이 조금 불명확하게 표현된 것 같습니다.

예를 들어, 아래와 같은 fp32 weight가 있는 경우

torch.float32
tensor([-1.0000,  0.3520,  1.3210,  2.0000])

해당 데이터를 아래와 같이 양자화 시킬 수 있습니다.
(아래 포맷이 Torch에서 제공하는 Dynamic Quantization model의 레이어/데이터 표현법으로 확인하였습니다.

torch.quint8
tensor([-1.0000,  0.4000,  1.3000,  2.0000], size=(4,), dtype=torch.quint8,
       quantization_scheme=torch.per_tensor_affine, scale=0.1, zero_point=10)

이 때, weight의 datetype은 말씀처럼 "quint8"이라 fp32에 비해 적은 공간을 사용함으로 모델 사이즈가 줄어든다는 것은 이해하였습니다.
다만, logging을 할 때 왜 정수(int)가 아닌 소수(float)으로 표현되는지가 의문이었습니다.
양자화를 거친 weight의 value는 Quant/DeQuant(fp32->int8->fp32) 작업으로 인해서 quantization noise가 조금 섞인 값으로 보입니다.

질문을 정리하자면 아래와 같습니다.

Dynamic Quantization에서 모델을 저장할 때는 int로 저장하고, load를 할 때에 fp32로 convert를 하는건가요? (만약 맞다면, 그 이유는 무엇인가요?)
Dynamic Quantization에서는 int 연산에 최적화된 Quantized_operators를 사용하지 않는건가요?

감사합니다.

First of all, Thank you for your detail reply.
I think my question is a little bit unclear.

for example, if there are weights of fp32 datatype as below

torch.float32
tensor([-1.0000,  0.3520,  1.3210,  2.0000])

the data can be quantized as follows

torch.quint8
tensor([-1.0000,  0.4000,  1.3000,  2.0000], size=(4,), dtype=torch.quint8,
       quantization_scheme=torch.per_tensor_affine, scale=0.1, zero_point=10)

I understand the model size can be reduced by using less space compared to fp32.
Because the datatype of the weight is converted fp32 to quint8.

However, I was wondering why it is expressed as a decimal number (float) instead of an integer (int) when logging.

To summarize the questions:

When saving a model in Dynamic Quantization, does PyTorch Quant lib save it as int and convert it to fp32 when loading? (If yes, why?)
Does Dynamic Quantization not use Quantized_operators optimized for int operations?

thank you.

seungtaek94 · January 19, 2022, 6:27am

Pytorch 코드를 봤는데 그냥 규칙인것 같습니다.
pytorch/_tensor_str.py at master · pytorch/pytorch · GitHub
사용합니다.

HDCharles · January 19, 2022, 8:34pm

However, I was wondering why it is expressed as a decimal number (float) instead of an integer (int) when logging.

For the same reason you don’t output a bunch of 1’s and 0’s when you print an fp32 number. All these numbers are just 1’s and 0’s in your computer. Its what they represent that’s important. You never print our an fp32 number, you print out the decimal representation of the 1’s and 0’s. The ‘non integer’ values that are being output for qint8 are the same, those are the actual values that it represents. In order to do efficient computations with these values, it utilizes some aspects of int8 data storage/ops, but thats only really important if you are trying to mess with things at a really low level. For most purposes its better to just think of qint8 as similar to fp16 i.e. not limited to integers, takes up less space, has lower fidelity compared to fp32.

When saving a model in Dynamic Quantization, does PyTorch Quant lib save it as int and convert it to fp32 when loading? (If yes, why?)

They are stored as qtensors (qint8) which, as mentioned before, is different from int8. https://github.com/pytorch/pytorch/blob/3e43c478a8832cec063aa566583a05f87d7dc3b0/torch/nn/quantized/modules/linear.py#L230 This is the deserialization code which takes the weight and bias and then packs them into the format that is necessary for the efficient computation of Linear.

Does Dynamic Quantization not use Quantized_operators optimized for int operations?

Yes it does generally use int8 operations (in part) when performing quantized operations. Here is a reasonable explanation of how qint8 Linear can be broken down into other ops, including int8 ops, to speed things up: Quantization for Neural Networks - Lei Mao's Log Book (the first part explain quantization at a reasonable level, the part I’m talking about though is the Quantized Matrix Multiplication section)