Compress model size by dynamic quantization to improve the saving/loading time

mojzaar · December 6, 2021, 3:53pm

Hi,

I have a large network (200M parameters), and saving/loading it takes a few minutes.

I’m curious if we can utilize dynamic quantization to shrink the model size and then save it.

I attempted to quantize the linear layers, but for training, they will all remain on the CPU, and dynamic quantization does not support linear layers on the GPU.

Is there a way to de-quantize the loaded model for the training and only quantize it for the saving to lower the model size and save time?