Hi,
I have a large network (200M parameters), and saving/loading it takes a few minutes.
I’m curious if we can utilize dynamic quantization to shrink the model size and then save it.
I attempted to quantize the linear layers, but for training, they will all remain on the CPU, and dynamic quantization does not support linear layers on the GPU.
Is there a way to de-quantize the loaded model for the training and only quantize it for the saving to lower the model size and save time?