Linear behave differently with 2D and 3D input

Hi guys,
I run the following code on torch 1.13.1, but got the inequal results. Why Linear behave differently with 2D and 3D input?

a = torch.randn(3, 4, 5)
l1 = torch.nn.Linear(5, 7, bias=True)
torch.equal(l1(a)[0, 0, :], l1(a[0, 0, :]))    # False
torch.equal(l1(a)[0:2, 0:4, :], l1(a[0:2, 0:4, :]))   # True
torch.equal(l1(a)[0:2, 0:3, :], l1(a[0:2, 0:3, :]))   # False

Hi @XinJiade,

The difference is most likely due to floating point precision, which for torch.float32 precision is around 1e-7. Although the operation are mathematically identical, their floating point operation will differ.

a = torch.randn(3, 4, 5)
l1 = torch.nn.Linear(5, 7, bias=True)
torch.allclose(l1(a)[0, 0, :], l1(a[0, 0, :])) #True
torch.allclose(l1(a)[0:2, 0:4, :], l1(a[0:2, 0:4, :])) #True
torch.allclose(l1(a)[0:2, 0:3, :], l1(a[0:2, 0:3, :])) #True

You can print out the difference between the terms and see that they have elements which differ by 1e-8, when checking the difference between 2 tensors use torch.allclose rather than torch.equal

1 Like

Thank you for your response.
In my situation, these differences will be acummulated as the layers go deeper. I wonder why the operation orders of 2D and 3D linear are different? Are there any documents about the operation orders of 2d and 3d linear?

The bulk of machine learning is a minimization problem. Thus small discrepancies in rounding can act as a form of regularization(albeit, minor) - so it can help prevent overfitting. For this reason, it’s better to use float16 or bfloat16.

1 Like

This is expected and neither result is “more correct” in such a case and should show a similar error to a wider dtype. If these expected numerical errors, caused by the limited floating point precision, are causing issues for your use case, you might want to use e.g. float64.

1 Like