Hello! I’ve faced with runtime error when after QAT on GPU I try to convert my model:
...
backbone = get_model().cuda()
# QAT - step 1 - fuse layers for numerical stability
torch.quantization.fuse_modules(backbone, modules_to_fuse=[["conv1", "bn1"],
["fc5", "bn5"]],
inplace=True)
# QAT - step 2 - add quantization observers into network
backbone = QuantizationWrapper(model_fp32=backbone)
# QAT - step 3 - select quantization config
quantization_config = torch.quantization.get_default_qat_qconfig("qnnpack")
backbone.qconfig = quantization_config
# QAT - step 4 - prepare model to QAT
torch.quantization.prepare_qat(backbone, inplace=True)
backbone.train()
... # here is my training cycle and it works well according to metric probes
# So, after training, I try to convert to quantized state
quantized_backbone = deepcopy(backbone).eval().cpu() # nor .cpu() nor .cuda() works
torch.quantization.convert(quantized_backbone, inplace=True) # Error
...
Error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument observer_on in method wrapper_CUDA___fused_moving_avg_obs_fq_helper)
If I move backbone to cpu in the beginning of the script, error disappears, but training slows down dramatically So I want to preserve training on GPU and only then convert. Please, help me
python 3.10.12, torch 2.1.1+cu121, ubuntu 22.04