I have quantization a model from 32-bit float to int8. I want to test the quantization performance, such as latency. Is there any way to inference the model with 8bit fix point?
Hi @0Chen , have you tried the autograd profiler (example: PyTorch Profiler — PyTorch Tutorials 1.8.1+cu102 documentation)?
That can test time and memory usage. However, I want to test in low-bit compute, such as 8bit or 6bit. Could pytorch just allocate a tensor with 8bit.
@0Chen I am facing the same issue too! Does anyone know how to perform inference after quantizing the model?? When I try to get the output, I get this error
Error(s) in loading state_dict for Module: ('Copying from quantized Tensor to non-quantized Tensor is not allowed, please use dequantize to get a float Tensor from a quantized Tensor',)
If you have a quantized model, then it is doing low-bit computation depending on what format it was quantized to. Otherwise, I’m not sure I understand what you are trying to do if your quantized model isn’t already doing what you want.
I don’t think this is related to OP’s issue. But it looks like you are trying to load a saved quantized model into a non-quantized model. the state_dict to my knowledge is just a record of the parameters for different modules. Since quantized modules have different parameters than non-quantized ones, you won’t be able to load one into another. I believe that when you create the model before loading the state dict, you have to fuse/prepare/convert it, at which point you should be able to load the state dict per How do I save and load quantization model
If this is not helpful, can you make a separate post within the #quantization category so as not to hijack this thread? I am the oncall for quantization so I am responding to any new posts.