Hi, is nn.Linear expected to have that large relative difference when running on a 3d tensor, as opposed to looping through the dimension? Interestingly, the latter matches to numpy matrix multiplication, as shown below.

```
import torch
import numpy as np
fc = torch.nn.Linear(256, 128)
inp = torch.rand(3, 10, 256)
out1 = fc(inp).detach().numpy()
out2 = []
for i in range(3):
out2.append(fc(inp[i]))
out2 = torch.stack(out2).detach().numpy()
w = fc.weight.detach().numpy()
b = fc.bias.detach().numpy()
out3 = inp.numpy() @ w.T + b
# passes this line
np.testing.assert_allclose(out3, out2)
# fails here
np.testing.assert_allclose(out3, out1)
```

```
Traceback (most recent call last):
File "tmp.py", line 47, in <module>
np.testing.assert_allclose(out3, out1)
AssertionError:
Not equal to tolerance rtol=1e-07, atol=0
Mismatch: 78.8%
Max absolute difference: 5.9604645e-07
Max relative difference: 0.04850746
x: array([[[-3.184696e-01, 4.671749e-01, -3.306221e-01, ...,
-3.613108e-01, 3.210519e-01, -4.924317e-01],
[-5.997717e-06, 7.380165e-02, 6.725912e-02, ...,...
y: array([[[-3.184695e-01, 4.671748e-01, -3.306221e-01, ...,
-3.613108e-01, 3.210520e-01, -4.924318e-01],
[-5.986542e-06, 7.380170e-02, 6.725915e-02, ...,...
```