Unexpected nn.Linear behaviour

Hi, is nn.Linear expected to have that large relative difference when running on a 3d tensor, as opposed to looping through the dimension? Interestingly, the latter matches to numpy matrix multiplication, as shown below.

import torch
import numpy as np

fc = torch.nn.Linear(256, 128)

inp = torch.rand(3, 10, 256)

out1 = fc(inp).detach().numpy()

out2 = []
for i in range(3):
out2 = torch.stack(out2).detach().numpy()

w = fc.weight.detach().numpy()
b = fc.bias.detach().numpy()
out3 = inp.numpy() @ w.T + b 

# passes this line
np.testing.assert_allclose(out3, out2)
# fails here
np.testing.assert_allclose(out3, out1)
Traceback (most recent call last):
  File "tmp.py", line 47, in <module>
    np.testing.assert_allclose(out3, out1)

Not equal to tolerance rtol=1e-07, atol=0

Mismatch: 78.8%
Max absolute difference: 5.9604645e-07
Max relative difference: 0.04850746
 x: array([[[-3.184696e-01,  4.671749e-01, -3.306221e-01, ...,
         -3.613108e-01,  3.210519e-01, -4.924317e-01],
        [-5.997717e-06,  7.380165e-02,  6.725912e-02, ...,...
 y: array([[[-3.184695e-01,  4.671748e-01, -3.306221e-01, ...,
         -3.613108e-01,  3.210520e-01, -4.924318e-01],
        [-5.986542e-06,  7.380170e-02,  6.725915e-02, ...,...

The difference is most likely created by the limited precision using FP32.
If you use DoubleTensors, your difference should be smaller:

fc = torch.nn.Linear(256, 128).double()
inp = torch.rand(3, 10, 256).double()