In the tutorials of Quantization, there is a mention that the model size would be reduced by using Dynamic Quantization(DQ).
After I merged the DQ codes, I found the model size is reduces(5MB->2MB). that what I expected.
However, I am wondering why the model is reduces.
so, I tried to log the model state_dict and the log is following.
the original modelโs state_dict()
'model.layers.3.residual_group.blocks.5.mlp.fc1.weight', tensor([[-0.0133, -0.0458, -0.0438, ..., -0.0109, 0.0203, -0.0292],
[ 0.0185, 0.0241, 0.0071, ..., 0.0204, 0.0048, -0.0240],
[-0.0027, -0.0198, -0.0116, ..., -0.0246, -0.0079, -0.0145],
...,
[-0.0086, 0.0161, 0.0068, ..., 0.0200, 0.0013, -0.0164],
[ 0.0080, -0.0006, -0.0074, ..., 0.0420, -0.0109, 0.0062],
[-0.0169, 0.0129, 0.0252, ..., -0.0208, -0.0016, -0.0064]])), ('model.layers.3.residual_group.blocks.5.mlp.fc1.bias', tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])),
this is DQโs state_dict()
('model.layers.3.residual_group.blocks.5.mlp.fc1.scale', tensor(1.)),
('model.layers.3.residual_group.blocks.5.mlp.fc1.zero_point', tensor(0)),
('model.layers.3.residual_group.blocks.5.mlp.fc1._packed_params.dtype', torch.qint8),
('model.layers.3.residual_group.blocks.5.mlp.fc1._packed_params._packed_params', (tensor([[-0.0131, -0.0459, -0.0435, ..., -0.0107, 0.0203, -0.0292],
[ 0.0185, 0.0238, 0.0072, ..., 0.0203, 0.0048, -0.0238],
[-0.0024, -0.0197, -0.0119, ..., -0.0244, -0.0078, -0.0143],
...,
[-0.0083, 0.0161, 0.0066, ..., 0.0197, 0.0012, -0.0161],
[ 0.0078, -0.0006, -0.0072, ..., 0.0417, -0.0107, 0.0060],
[-0.0167, 0.0131, 0.0250, ..., -0.0209, -0.0018, -0.0066]],
size=(120, 60), dtype=torch.qint8,
quantization_scheme=torch.per_tensor_affine, scale=0.0005962323630228639,
zero_point=0), tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
requires_grad=True))),
I think DQโs weight/bias looks like fp32 not int8.
in this reason, I wonder how to reduce model size by using fp32 weight format.
could you tell me why the this is happend?
Thank you for your help.