Thanks for updating the code!
Based on the new code it should work fine exporting the model on the CPU and loading and running it on the GPU:
torch.export.save(exported_program, "test.pt")
m = torch.export.load("test.pt")
print(m)
m.module()(*[e.cuda() for e in example_args])
# RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)
m.module().cuda()
out = m.module()(*[e.cuda() for e in example_args])
assert torch.allclose(out.cpu(), out_ref)