Difference in inference between pytorch model and exported onnx model

Konrad_X · March 30, 2023, 9:49am

Hi,

I’ve got a problem with running inference on onnx model exported from torch model by torch.model.export. I am running my code in docker container on multiple machines. Work is as follows:
I create a dummy pytorch model with two hidden layers. Then I export this model into onnx model with some dummy input shape. And then I run inference on both of them on the same data and compare those results. On almost every machine result are completely the same(comparing them with just “==” works perfectly) but on one there are differences up to 1e-05 atol. I’ve tried to run torch.onnx.verification.find_mismatch to somehow compare them but honestly I don’t understand how this function works - it accepts torch model and input shape but where exactly am I supposed to provide the exported onnx model? Also do you maybe have any idea why there are difference only on some machines? Libraries versions, opset version are the same on every machine.

Thanks in advance.

ptrblck · March 31, 2023, 12:04am

Small numerical mismatches are expected on different setups as internally different algorithms can be used. These algorithms could use a different order of operations which will result in these small errors due to the limited floating point precision.
Here is a small example showing it for two approaches calculating the sum of a tensor:

x = torch.randn(100, 100)
s1 = x.sum()
s2 = x.sum(0).sum(0)
print((s1 - s2).abs())
# tensor(2.2888e-05)