Inference result is different between Pytorch and ONNX model

ys.yusaito · March 23, 2022, 8:04am

Problem

Hi,

I converted Pytorch model to ONNX model.
However, output is different between two models like below.
inference_result

inference environment

Pytorch

・python 3.7.11
・pytorch 1.6.0
・torchvision 0.7.0
・cuda tool kit 10.1
・numpy 1.21.5
・pillow 8.4.0

ONNX

・onnxruntime-win-x64-gpu-1.4.0
・Visual studio 2017
・Cuda compilation tools, release 10.1, V10.1.243

Network of model

multi_inout_network
detai information of our model

Script for converting model from Pytorch to ONNX

import torch
import torch.onnx as onnx
import torchvision.models as models
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):

    def __init__(self, pretrained):
        # スーパークラス（Module クラス）の初期化メソッドを実行 
        super().__init__() 
        
        self.pretrained = pretrained
        
        self.pretrained.fc = nn.Linear(2048, 128)
        self.fc1 = nn.Linear(256, 128)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(256, 128)
        self.fc_y1 = nn.Linear(128*2, 64)
        self.fc_y1_1 = nn.Linear(64, 32)
        self.fc_y1_2 = nn.Linear(32, 16)
        self.fc_y1_3 = nn.Linear(16, 32)
        self.fc_y1_4 = nn.Linear(32, 16)
        self.fc_y1_5 = nn.Linear(16, 8)
        self.fc_y1_6 = nn.Linear(8, 2)
        
        self.fc_y2 = nn.Linear(128*2, 64)
        self.fc_y2_1 = nn.Linear(64, 32)
        self.fc_y2_2 = nn.Linear(32, 16)
        self.fc_y2_3 = nn.Linear(16, 32)
        self.fc_y2_4 = nn.Linear(32, 16)
        self.fc_y2_5 = nn.Linear(16, 8)
        self.fc_y2_6 = nn.Linear(8, 2)
        
        self.bn_256 = nn.BatchNorm1d(num_features = 256)
        self.bn_128 = nn.BatchNorm1d(num_features = 128)
        self.bn_64 = nn.BatchNorm1d(num_features = 64)
        self.bn_32 = nn.BatchNorm1d(num_features = 32)
        self.bn_16 = nn.BatchNorm1d(num_features = 16)
        self.bn_8 = nn.BatchNorm1d(num_features = 8)

    def forward(self, x0, x1, x2, x3): # 入力から出力を計算するメソッドを定義
        x0 = self.pretrained(x0)
        x1 = F.relu(self.bn_128(self.fc1(x1)))
        x2 = F.relu(self.bn_128(self.fc2(x2)))
        x3 = F.relu(self.bn_128(self.fc3(x3)))
        
        x123 = x1 + x2 + x3
        x = torch.cat((x0, x123), 1)
        
        y0 = F.relu(self.bn_64(self.fc_y1(x)))
        y0 = F.relu(self.bn_32(self.fc_y1_1(y0)))
        y0 = F.relu(self.bn_16(self.fc_y1_2(y0)))
        y0 = F.relu(self.bn_32(self.fc_y1_3(y0)))
        y0 = F.relu(self.bn_16(self.fc_y1_4(y0)))
        y0 = F.relu(self.bn_8(self.fc_y1_5(y0)))
        y0 = self.fc_y1_6(y0)
        
        y1 = F.relu(self.bn_64(self.fc_y2(x)))
        y1 = F.relu(self.bn_32(self.fc_y2_1(y1)))
        y1 = F.relu(self.bn_16(self.fc_y2_2(y1)))
        y1 = F.relu(self.bn_32(self.fc_y2_3(y1)))
        y1 = F.relu(self.bn_16(self.fc_y2_4(y1)))
        y1 = F.relu(self.bn_8(self.fc_y2_5(y1)))
        y1 = self.fc_y2_6(y1)
        
        return y0, y1


resnet = models.resnet50(pretrained=True)

model = Net(resnet)
model.load_state_dict(torch.load('path to Pytorch model'))
model.eval()


input_image = torch.randn(1,3,224,224, requires_grad=True)
input_hist_R = torch.randn(1,256, requires_grad=True)
input_hist_G = torch.randn(1,256, requires_grad=True)
input_hist_B = torch.randn(1,256, requires_grad=True)

input_tuple = (input_image,input_hist_R,input_hist_G,input_hist_B)
input_names = [ "input.1","input.154","input.156","input.158" ]
output_names = [ "593","612" ]

onnx.export(model, input_tuple, 'multi_inout.onnx',export_params=True,opset_version=12,do_constant_folding=True,input_names=input_names, output_names=output_names)

Environment for converting model

・python 3.7.11
・pytorch 1.6.0
・torchvision 0.7.0
・cuda tool kit 10.1
・numpy 1.21.5
・pillow 8.4.0

Is there something wrong with the way of the model conversion?
If necessary, I can share inference scripts and models.

Please help…

marksaroufim · March 23, 2022, 4:03pm

Which opset are you using? See this for more detail ONNX returning different results than same PyTorch model · Issue #2831 · onnx/onnx · GitHub

github.com/microsoft/onnxruntime

Inference on GPU is not deterministic

opened 03:53PM - 24 Jul 20 UTC

closed 03:58PM - 24 Jul 20 UTC

erikbrntsn

**Describe the bug** In C#: Two instances of the same model running on the GPU… does not produce the same result. The results are numerically identical when using the CPU however. **Urgency** Some urgency. We would like to use ORT to run some ML components in a large computer vision application. Non-determinism makes our development more difficult and therefore ORT less appealing. **System information** - OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 Build 18362 - ONNX Runtime installed from (source or binary): NuGet - ONNX Runtime version: Microsoft.ML.OnnxRuntime.Gpu 1.3, Microsoft.ML.OnnxRuntime.Managed 1.3 - Python version: 3.7.3 - Visual Studio version (if applicable): VS2019 Version 16.4.5 - GCC/Compiler version (if compiling from source): - CUDA/cuDNN version: 10.1 - GPU model and memory: GeForce 2080 Ti - driver 441.22 **To Reproduce** A small model that reproduces the problem can be created using the following code: ``` import torch from torch import nn class Model(nn.Module): def __init__(self): super(Model, self).__init__() c = 10 self.relu = torch.nn.ReLU(inplace=True) self.conv1 = torch.nn.Conv2d(1, c, kernel_size=3, stride=1, padding=1) self.conv2 = torch.nn.Conv2d(c, 1, kernel_size=3, stride=1, padding=1) self.batchNorm = torch.nn.BatchNorm2d(c) def forward(self, data): return self.conv2(self.batchNorm(self.relu(self.conv1(data)))) if __name__ == "__main__": import sys if len(sys.argv) != 2: print("Usage: <path/to/output>") exit() model = Model() model_input = torch.zeros([1, 1, 123, 234]) input_names = ["input"] with open(sys.argv[1], "wb") as fh: torch.onnx.export(model, model_input, fh, opset_version=11, do_constant_folding=True, input_names=input_names, output_names=["output"], dynamic_axes={name: {2: "rows", 3: "cols"} for name in input_names}) ``` To reproduce the non-deterministic behavior run the following in C# (eg. in a unit test) ``` var modelPath = @"path\to\model"; var options = SessionOptions.MakeSessionOptionWithCudaProvider(); using (var model1 = new InferenceSession(modelPath, options)) using (var model2 = new InferenceSession(modelPath, options)) { int rowsIn = 123; int colsIn = 234; int nChannelsIn = 1; var dims = new[] { 1, nChannelsIn, rowsIn, colsIn }; var rnd = new Random(); var data = new float[nChannelsIn * colsIn * rowsIn]; for (int i = 0; i < data.Length; i++) data[i] = (float) (rnd.NextDouble() - 0.5); var input = new DenseTensor<float>(data, dims); var inputORT = new List<NamedOnnxValue> { NamedOnnxValue.CreateFromTensor(model1.InputMetadata.Keys.First(), input) }; var output1 = model1.Run(inputORT).First().AsTensor<float>(); var output2 = model2.Run(inputORT).First().AsTensor<float>(); int rowsOut = 123; int colsOut = 234; int nChannelsOut = 1; for (int c = 0; c < nChannelsOut; c++) for (int row = 0; row < rowsOut; row++) for (int col = 0; col < colsOut; col++) Assert.AreEqual(output1[0, c, row, col], output2[0, c, row, col]); } ``` **Expected behavior** The outputs of the two models are identical **Additional context** As mentioned, two models created on the CPU produce identical results. Passing the same input twice to the same model also produces identical results. Note that you might have to run the test code a few times to reproduce the problem. For me the two results are different roughly every second time I run it.

ys.yusaito · March 24, 2022, 6:19am

Hi, @marksaroufim

Thank you for replying.
I used opset 12.
If I used opset 11, the inference result didn’t change.

I have read several issues regarding the differences in inference results between pytorch and onnx, including the one you sent me, but have not found a resolution.

Is this problem ONNX bug??
If so, will this problem be resolved in the future?

marksaroufim · March 24, 2022, 7:32pm

It’s not a bug but more expected behavior, I haven’t personally found a good explanation of why it happens but opening an issue on the ONNX github may be the first step.

ys.yusaito · March 25, 2022, 12:54am

Hi, @marksaroufim

I see.
I’ll post issue on the ONNX github.

ys.yusaito · March 25, 2022, 3:06am

I opened the issue in Inference result is different between Pytorch and ONNX model · Issue #4087 · onnx/onnx · GitHub.
However, I was told that it would be best to have the model conversion handled here.
So, can someone here help me…?

I share models and inference scripts via email, if necessary.