Difference between using Linear and manually setting up parameters and implementing forward?

I started to learn PyTorch and am looking for some help to understand the basics. I implemented two classes A and B (shown below). I expected them to do the same thing.

However, this is not the case. When training A the losses start in the three digits range, while using the same data and fit loop with B (using Linear) starts with five digits losses. They both then go into the right direction, but I am still wondering what the difference is?

FWIW I went through the forums and found the topic of initialization and in C I tried to imitate what I found in Linear’s init() and reset_parameters() as good as I could. But it did not change the initial losses reported.

class A(nn.Module):
    def __init__(self):
        self.weights = nn.Parameter(torch.rand(224*224*3, 1) / math.sqrt(224*224*3))
        self.bias = nn.Parameter(torch.zeros(1))
    def forward(self, xb):
        return xb.view(xb.size(0), -1) @ self.weights + self.bias

class B(nn.Module):   
    def __init__(self):
        self.lin= nn.Linear(224*224*3, 1)
    def forward(self, xb):
        return self.lin(xb.view(xb.size(0), -1))

class C(nn.Module):
    def __init__(self):
        self.weights = nn.Parameter(torch.Tensor(224*224*3, 1))
        init.kaiming_uniform_(self.weights, a=math.sqrt(224*224*3))
        bound = 1 / math.sqrt(224*224*3)
        self.bias = nn.Parameter(torch.Tensor(1))
        init.uniform_(self.bias, -bound, bound)  

    def forward(self, xb):
        return xb.view(xb.size(0), -1) @ self.weights + self.bias

#FWIW here is also the fit loop.
epochs = 12
lr = 1e-8
model = B() # or A()

for epoch in range(epochs):
    for xb, yb in train_dl:
        yb_ = model(xb)
        loss = F.mse_loss(yb_, yb)
        if n % 20 == 0: print(f'loss: {loss.item():05.2f}', )
        with torch.no_grad():
            for p in model.parameters(): 
                p -= p.grad * lr
    print(epoch, loss.item(), math.sqrt(loss.item()))

I think you are not using the same std for initializing your weight. PyTorch uses Kaiming init (so output size, albeit with a somewhat funny gain).

Best regards