RuntimeError in torch.quantization.convert after QAT on GPU

Hello! I’ve faced with runtime error when after QAT on GPU I try to convert my model:

backbone = get_model().cuda()
# QAT - step 1 - fuse layers for numerical stability
torch.quantization.fuse_modules(backbone, modules_to_fuse=[["conv1", "bn1"],
                                                           ["fc5", "bn5"]],
# QAT - step 2 - add quantization observers into network
backbone = QuantizationWrapper(model_fp32=backbone)
# QAT - step 3 - select quantization config
quantization_config = torch.quantization.get_default_qat_qconfig("qnnpack") 
backbone.qconfig = quantization_config
# QAT - step 4 - prepare model to QAT
torch.quantization.prepare_qat(backbone, inplace=True)

...  # here is my training cycle and it works well according to metric probes

# So, after training, I try to convert to quantized state 
quantized_backbone = deepcopy(backbone).eval().cpu()  # nor .cpu() nor .cuda() works
torch.quantization.convert(quantized_backbone, inplace=True)  # Error


RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument observer_on in method wrapper_CUDA___fused_moving_avg_obs_fq_helper)

If I move backbone to cpu in the beginning of the script, error disappears, but training slows down dramatically :frowning: So I want to preserve training on GPU and only then convert. Please, help me :pleading_face:

python 3.10.12, torch 2.1.1+cu121, ubuntu 22.04

I have performed some tests in different versions of torch and have found that:

  • 2.0.0: :white_check_mark:

  • 2.1.1: :x:

  • 2.2.0: :x:


:white_check_mark: - no errors

:x: - RuntimeError

how did you get the initial model? is this a exported model (model after torch.export)? can you print the quantized_backbone before convert?