The difference is most likely due to floating point precision, which for torch.float32 precision is around 1e-7. Although the operation are mathematically identical, their floating point operation will differ.
You can print out the difference between the terms and see that they have elements which differ by 1e-8, when checking the difference between 2 tensors use torch.allclose rather than torch.equal
Thank you for your response.
In my situation, these differences will be acummulated as the layers go deeper. I wonder why the operation orders of 2D and 3D linear are different? Are there any documents about the operation orders of 2d and 3d linear?
The bulk of machine learning is a minimization problem. Thus small discrepancies in rounding can act as a form of regularization(albeit, minor) - so it can help prevent overfitting. For this reason, it’s better to use float16 or bfloat16.
This is expected and neither result is “more correct” in such a case and should show a similar error to a wider dtype. If these expected numerical errors, caused by the limited floating point precision, are causing issues for your use case, you might want to use e.g. float64.