Convert PyTorch model to Onnx format (inference not same)

I have tested with torch.ones(1,3,224,224){input} and the model works but with my own inputs and same preprocessing used in training the onnx model produces different outputs from the torch model.
Code to convert the model to onnx format

import onnx

from onnx_tf.backend import prepare

model = model

model.load_state_dict(torch.load(PATH, map_location=torch.device('cpu')))

print("Model is loaded")

model.eval()

#Export model to ONNX format

x = torch.randn(1, 3, 224, 224).to(device)

torch.onnx.export(model, 

                  x, 

                  "vgg16.onnx", 

                  opset_version=10,

                  do_constant_folding=True,

                  export_params=True,

                  input_names = ["input"],

                  output_names =["output"],

                  verbose=True,

                  dynamic_axes={'input' : {0 : 'batch_size'},    # variable length axes

                                'output' : {0 : 'batch_size'}}

                  )

Code for inference in onnx:

ort_session = ort.InferenceSession("/content/vgg16.onnx")

x = test_dataset.load_img(1).transpose(2,0,1)

plt.imshow(test_dataset.load_img(1))

def to_numpy(tensor):

    if tensor.requires_grad:

        return tensor.detach().cpu().numpy()

    return tensor.cpu().numpy()

outputs = ort_session.run(

    None,

    {"input": x[None].astype("float32")},

)

# compare ONNX Runtime and PyTorch results

np.testing.assert_allclose(torch_out.detach().cpu(), 

                           outputs[0], 

                           rtol=1e-03, 

                           atol=1e-05)

print("Exported model has been tested with ONNXRuntime, and the result looks good!")

AssertionError:
Not equal to tolerance rtol=0.001, atol=1e-05

Mismatched elements: 36 / 36 (100%)
Max absolute difference: 0.17844993
Max relative difference: 0.8394638
x: array([[0.171307, 0.180779, 0.179579, 0.225714, 0.232095, 0.220075,
0.443109, 0.470671, 0.488748, 0.538834, 0.530197, 0.539141,
0.038368, 0.028497, 0.096283, 0.401647, 0.279558, 0.50373 ,…
y: array([[0.338318, 0.345975, 0.340239, 0.349426, 0.356006, 0.352419,
0.478905, 0.489058, 0.498031, 0.514408, 0.505635, 0.498025,
0.199641, 0.17751 , 0.274733, 0.458645, 0.396497, 0.490221,…

Can you try running model.eval() before running inference of pytorch model

1 Like

I have applied model.eval() before running the inference of the PyTorch model. Still getting this output difference.

I am using the below versions
Torch version: 1.10.0+cu111
Onnx version: 1.11.0

Are you tracing your model and if so do you have a data-dependent control flow in the forward, which would be traced to a static execution?
Would scripting the model instead help?

1 Like

I am not tracing my model. I am using a pretrained vgg16, finetuning it and saving the model to a .pth file → .onnx

My onnx graph looks like this
image

torch.onnx.export would trace the model as described in the docs:

Exports a model into ONNX format. If model is not a torch.jit.ScriptModule nor a torch.jit.ScriptFunction, this runs model once in order to convert it to a TorchScript graph to be exported (the equivalent of torch.jit.trace()). Thus this has the same limited support for dynamic control flow as torch.jit.trace().

In any case, I cannot reproduce the issue and get the same results up to the expected difference due to the limited floating point precision:

# setup
model = models.vgg16().eval()
x = torch.randn(1, 3, 224, 224)

# PyTorch reference output
out = model(x)

# export to ONNX
torch.onnx.export(
    model,
    x,
    'vgg.onnx',
    input_names = ["input"],
    output_names =["output"],
    verbose=True,
    dynamic_axes={'input' : {0 : 'batch_size'},
                  'output' : {0 : 'batch_size'}}
)

# ONNX reference output
ort_session = ort.InferenceSession("vgg.onnx")
outputs = ort_session.run(
    None,
    {"input": x.numpy()},
)

# compare ONNX Runtime and PyTorch results
print(np.max(np.abs((out.detach().numpy() - outputs))))
# > 8.6426735e-07
2 Likes

Thanks for the help. II managed to resolve the issue. It turns out that it was something with the preprocessing because when I use the PyTorch data loader image to evaluate an image in onnx it works but with my own custom data loader I build to load data before being passed into the torch data loader seems to cause this problem somehow