Why the model fail to converge without a manul weight initialization?

hi all,

I am trying to create a most simple 1 linear layer network to fit a linear regression. Just to help myself better understand how Pytorch works. However, I encountered a strange issue with the model training.

in my model’s init() method, I have to add a manual initialization step(shown below) to have the model quickly converge to my regression function. (the weight value 2, 3 are random number, I could put any value here and the model will still converge)

self.layer1.weight = torch.nn.Parameter(torch.Tensor([2, 3]))

Without this line, the model never converge, the training loss just randomly oscillates in the range of hundreds of thousands. With this line, it quickly decreases to near 1.

I have postulated that it is because default initial weight parameters were too small if I do not initialize them to be far away from zero. Then I changed the initial values and found out the convergence always work as long as I have this line, the exact value I set does not matter. Could someone explain what is going on behind the scene here? Thanks.

My entire script:

import torch
import numpy as np

class Net(torch.nn.Module):
    
    def __init__(self, input_dim, output_dim):
        super(Net, self).__init__()
        self.layer1 = torch.nn.Linear(input_dim, output_dim, bias=False)
        self.layer1.weight = torch.nn.Parameter(torch.Tensor([2, 3]))
        
    def forward(self, x):
        x = self.layer1(x)
        x.squeeze()
        return x
        
# generate data using the linear regression setup y = 5 * x1 + 3 * x2

sample_size = 10000
input_dim = 2
output_dim = 1
epoch = 30
bs = 100

data = np.random.randn(sample_size, 3)
data[:, :2] = data[:, :2] * 100
# add a normal noise term
data[:, 2] = 5 * data[:, 0] + 3 * data[:, 1] + np.random.randn(sample_size)
data = torch.Tensor(data)
train_x = data[:, :input_dim]
train_y = data[:, input_dim]

net = Net(input_dim, output_dim)
net.zero_grad()
criterion = torch.nn.MSELoss()
optimizer = torch.optim.RMSprop(net.parameters(), lr=.01)

for i in range(epoch):
    
    batch = 0
    while batch * bs < train_x.shape[0]:
        
        batch_x = train_x[batch * bs : (batch + 1) * bs, :]
        batch_y = train_y[batch * bs : (batch + 1) * bs]
        
        pred_y = net.forward(batch_x)
        loss = criterion(pred_y, batch_y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if batch % 100 == 0:
            #print(f"{i} {batch} {loss}")
            print(net.layer1.weight)
        batch += 1