Export to ONNX failure with mismatched devices

hi PyTorch, I am trying to do ONNX conversion for a module and encountered following error:

from torch.onnx.symbolic_helper import _constant_folding_opset_versions
        if do_constant_folding and _export_onnx_opset_version in _constant_folding_opset_versions:
            params_dict = torch._C._jit_pass_onnx_constant_fold(graph, params_dict,
>                                                               _export_onnx_opset_version)
E           RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

This issue happens for PyTorch v1.11.
I explicitly checked, on the pytorch side, the tensors are all on GPU.
But when I trace into constant_fold.cpp , failure was triggered for an onnx:mul() OP, two input values are on different devices, CPU and GPU.

Can some one help on this?

Thanks in advance!

Cross-posting from Slack:

Could you also check if tracing the model would already fail (i.e. via torch.jit.trace )?