During static quantization of my model, I encounter the following error -
RuntimeError: Didn’t find kernel to dispatch to for operator ‘aten::_cat’. Tried to look up kernel for dispatch key ‘QuantizedCPUTensorId’. Registered dispatch keys are: [CPUTensorId, VariableTensorId]
I have fused and quantized the model, as well as the input image. But it throws an error on concat raised by - y = torch.cat([sources[0], sources[1]], dim=1)
Any suggestions would be appreciated.
Full code here -
@dskhudia
Thank you very much, I changed
score_link = y[0,:,:,1].cpu().data.numpy() to score_link = y[0,:,:,1].int_repr().cpu().data.numpy() as per your suggestion. But the prediction is very bad.
You may want to try some quantization accuracy improvement techniques such as
per channel quantization for weights
Quantization aware training
Measuring torch.norm between float model and quantize model to see where it’s off the most.
and for norm you can use something like the following:
SQNR = []
for i in range(len(ref_output)):
SQNR.append(20*torch.log10(torch.norm(ref_output[i][0])/torch.norm(ref_output[i][0]-qtz_output[i][0])).numpy())
print('SQNR (dB)', SQNR)
Float16 quantized operators do not exist for static quantization. Since current cpus do not support float16 compute natively, converting to float16 for compute bound cases doesn’t provide much performance benefits.
ref_output is from the float model. You might want to check the norm at few different places in the network to see where we are deviating too much from floating point results.
In PyTorch there’s a way to compare the module level quantization error, which could help to debug and narrow down the issue. I’m working on an example and will post here later.
@Raghav_Gurbaxani, have you tried using histogram observer for activation? In most cases this could improve the accuracy of the quantized model. You can do:
model.qconfig = torch.quantization.QConfig(
activation=torch.quantization.default_histogram_observer,
weight=torch.quantization.default_per_channel_weight_observer)
Have you checked the accuracy of fused_model? By checking the accuracy of fused_model before converting to int8 model we can know if the issue is in the preprocessing part or in the quantized model.
If fused_model has good accuracy, the next step we can check the quantization error of the weights. Could you try the following code:
def l2_error(ref_tensor, new_tensor):
"""Compute the l2 error between two tensors.
Args:
ref_tensor (numpy array): Reference tensor.
new_tensor (numpy array): New tensor to compare with.
Returns:
abs_error: l2 error
relative_error: relative l2 error
"""
assert (
ref_tensor.shape == new_tensor.shape
), "The shape between two tensors is different"
diff = new_tensor - ref_tensor
abs_error = np.linalg.norm(diff)
ref_norm = np.linalg.norm(ref_tensor)
if ref_norm == 0:
if np.allclose(ref_tensor, new_tensor):
relative_error = 0
else:
relative_error = np.inf
else:
relative_error = np.linalg.norm(diff) / ref_norm
return abs_error, relative_error
float_model_dbg = fused_model
qmodel_dbg = quantized
for key in float_model_dbg.state_dict().keys():
float_w = float_model_dbg.state_dict()[key]
qkey = key
# Get rid of extra hiearchy of the fused Conv in float model
if key.endswith('.weight'):
qkey = key[:-9] + key[-7:]
if qkey in qmodel_dbg.state_dict():
q_w = qmodel_dbg.state_dict()[qkey]
if q_w.dtype == torch.float:
abs_error, relative_error = l2_error(float_w.numpy(), q_w.detach().numpy())
else:
abs_error, relative_error = l2_error(float_w.numpy(), q_w.dequantize().numpy())
print(key, ', abs error = ', abs_error, ", relative error = ", relative_error)
It should print out the quantization error for each Conv weight such as:
Looks like the first Conv basenet.slice1.3.0.weight has the largest error, could you try skipping the quantization of that Conv and keep it as the float module? We have previously seen some CV models’s first Conv is sensitive to quantization and skipping it would give better accuracy.
@hx89 actually it seems like all these have pretty high relative errors -
[ basenet.slice1.7.0.weight , basenet.slice1.10.0.weight , basenet.slice2.14.0.weight , basenet.slice2.17.0.weight , basenet.slice3.20.0.weight ,basenet.slice3.24.0.weight , basenet.slice3.27.0.weight , basenet.slice4.30.0.weight ,basenet.slice4.34.0.weight ]
although that seems like a good idea, keeping a few layers as float while converting the rest to int8.
I am not sure how to pass the partial model to torch.quantization.convert() for quantization and then combining the partially quantized model and unquantized layers together for inference on the image.
It’s actually simpler, to skip the first conv for example, there are two step:
Step 1: Move the quant stub after the first conv in the forward function of the module.
For example in the original quantizable module, quant stub is at the beginning before conv1:
Class QuantizableNet(nn.Module):
def __init__(self):
...
self.quant = torch.quantization.QuantStub()
self.dequant = torch.quantization.DeQuantStub()
def forward(self, x):
x = self.quant(x)
x = self.conv1(x)
x = self.maxpool(x)
x = self.fc(x)
x = self.dequant(x)
return x
To skip the quantization of conv1 we can move self.quant() aftert conv1:
Class QuantizableNet(nn.Module):
def __init__(self):
...
self.quant = torch.quantization.QuantStub()
self.dequant = torch.quantization.DeQuantStub()
def forward(self, x):
x = self.conv1(x)
x = self.quant(x)
x = self.maxpool(x)
x = self.fc(x)
x = self.dequant(x)
return x
Step 2: Then we need to set the qconfig of conv1 to None after prepare(), this way PyTorch knows we want to keep conv1 as float module and won’t swap it with quantized module:
model = QuantizableNet()
...
torch.quantization.prepare(model)
model.conv1.qconfig = None