After the neural network is quantized, how to use the GPU to infer the model?

# 实例化模型并进行量化准备
quantized_model = GACNFuseNet()
quantized_model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
quantized_model_prepared = torch.quantization.prepare(quantized_model, inplace=True)
quantized_model_int8 = torch.quantization.convert(quantized_model_prepared, inplace=True)



# 将模型移动到GPU'cuda')
# 两张输入图片
img1_tensor ="cuda")
img2_tensor ="cuda")
t1 = time.time()
# output = quantized_model_int8(input_tensor)
# 推理
with torch.no_grad():
    output = quantized_model_int8(img1_tensor, img2_tensor)
t2 = time.time()
with torch.no_grad():
    output_2 = model(img1_tensor, img2_tensor)
t3 = time.time()
print(f'Done. ({(1E3 * (t3 - t2)):.1f}ms) Inference.')
print(f'Done. ({(1E3 * (t2 - t1)):.1f}ms) Inference.')

How to use the GPU to infer the model? I get the quantized model ‘gacn_quant.pth’, then I give model and inputs to the cuda, but ‘“cuda”)’ get an error.
/home/ubuntu/anaconda3/envs/zzy-quant/bin/python3.8 /home/ubuntu/Data1/zzy/GACN/
/home/ubuntu/anaconda3/envs/zzy-quant/lib/python3.8/site-packages/torch/ao/quantization/ UserWarning: Please use quant_min and quant_max to specify the range for observers. reduce_range will be deprecated in a future release of PyTorch.
/home/ubuntu/anaconda3/envs/zzy-quant/lib/python3.8/site-packages/torch/ao/quantization/ UserWarning: must run observer before calling calculate_qparams. Returning default scale and zero point
/home/ubuntu/anaconda3/envs/zzy-quant/lib/python3.8/site-packages/torch/ UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of
img1.shape: (520, 520, 3)
img1.shape: torch.Size([1, 1, 520, 520])

Process finished with exit code 139

segment fault (core dumped)(段错误 (核心已转储))
I also try to convert the model to onnx and quantize it, but it doesn’t run correctly in the cuda, the inference time is same as cpu, it’s also same as unquantized model? why?

onnx_model_path = "mymodel.quant_static.onnx"
# onnx_model_path = "mymodel.onnx"
DEVICE_NAME = 'cuda' if torch.cuda.is_available() else 'cpu'
# 获取模型的输入名称
session = onnxruntime.InferenceSession(onnx_model_path,
                                       providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider',
# session = create_session(onnx_model_path, "cuda")
input_names = [ for input in session.get_inputs()]
output_names = [ for output in session.get_outputs()]
print("Model input names:", input_names)
print("Model output names:", output_names)
# 准备输入字典,确保名称与模型输入名称匹配
inputs = {
    input_names[0]: img1_tensor.numpy(),
    input_names[1]: img2_tensor.numpy()

t1 = time.time()
# # [array([5., 7., 9.], dtype=float32)]t1 = time.time()
# t2 = time.time()
# print(f'Done. ({(1E3 * (t2 - t1)):.1f}ms) Inference.')
print(, inputs))
# [ 2.  4.  6.  8. 10.]
t3 = time.time()
print(f'Done. ({(1E3 * (t3 - t1)):.1f}ms) Inference.')

unfortunately the flow you are using does not have good support for GPU, it is mainly for server CPU (fbgemm) and also mobile CPU (qnnpack/xnnpack). what kind of quantization you are planning to do? we have a new repo that might serve GPU quantization better: GitHub - pytorch/ao: Create and integrate custom data types, layouts and kernels with up to 2x speedups with 65% less VRAM for inference and support for training