[BUG] Error in ShareBuffers Optimization

Wilbur · August 7, 2021, 2:14am

Error Description

I encountered a model that, when run in onnxruntime and Glow respectively, their outputs are distinct.

The output of onnxruntime is

[-7.4040246e-01 -1.0558732e+00 -5.1735401e-01  5.7158279e+00
 -6.1898971e-01  4.8943150e-01 -5.3290629e-01 -7.5109071e-01
 -1.0089742e+00 -9.7962457e-01 -2.0205033e-01 -5.6859505e-01
 -8.3097053e-01 -6.0609245e-01 -9.2153203e-01 -8.8401639e-01
 -8.7050164e-01 -9.2254686e-01  6.3326435e+00 -5.2634275e-01
 -2.1332520e-01 -3.9393574e-02 -8.1680071e-01 -7.4010009e-01
 -9.0249908e-01 -9.0031087e-01 -8.4391761e-01 -8.9972514e-01
  5.8119040e+00 -4.5583522e-01  5.3716774e+00 -7.6865733e-01
 -3.2050097e-01 -6.3881731e-01 -7.8147006e-01 -8.8674474e-01
 -8.7157512e-01 -7.5937271e-01  1.0185689e-03 -3.4555185e-01]

, whereas the output of Glow is

[-0.742133, -1.054049, -0.511033, 5.701681, -0.617301, 0.488157, -0.531513, -0.749289, -1.006998, -0.977479, -0.201742, -0.567398, -0.830031, -0.606550, -0.920508, -0.882669, -0.869515, -0.921491, 6.325564, -0.525662, -0.213413, -0.039500, -0.816155, -0.739523, -0.901777, -0.899409, -0.843245, -0.898985, 5.806691, -0.454686, 5.361451, -0.770803, -0.321736, -0.639595, -0.781732, -0.886845, -0.871701, -0.757481, 0.010961, -0.342509]

Cause Analysis

According to the model definition, Node Mul_1024 should take the output of node Sub_1023 as input.

If you use -dump-ir-after-all-passes option to view IR generated by Glow, you will see that after the ShareBuffers optimization pass

...
34 %Sub_1023__1 = elementsub @out %Sub_1023__1_res, @in %Sub_1023__1_res, @in %A1227_transposed
...
98 %Sub_2268__1 = elementsub @out %Sub_1023__1_res, @in %Sub_1023__1_res, @in %A1227_transposed
...
...
...
179 %Sub_1023__1_res__3 = tensorview @in %Sub_1023__1_res { Ty: float<4 x 64 x 1 x 1>, Offsets:[0, 0, 0, 0]} // Users: @in 180
180 %Mul_1024 = elementmul @out %Mul_1024_res, @in %Sub_1023__1_res__3, @in %Add_3784_res
...

, node Sub_2268 writes its output to the buffer that is supposed to store the output of node Sub_1023. So in practice, in line 180, the input to node Mul_1024 is the output of node Sub_2268, which is the cause of the wrong output value.

How To Reproduce

You can download the package at rep.zip - Google Drive. After unzipping it, you can follow the steps in README.md to reproduce the results.

System Information:

Glow version: built from GitHub source commit 07a82bd9fe97dfd2e8ea0f4742dce5ce86177c2b

onnxruntime version: 1.7.0

onnx version: 1.9.0

Operation system: Ubuntu 18.04LTS

CPU: Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz 16 cores

BTW, we also report a similar issue (which seems also due to IR translation at [BUG] Error in IR Parsing