Loss not minimizing and parameters are not being updated

I am a beginner in pytorch. I am trying to build a custom neural network. But everytime the train loop runs, the loss doesn’t minimize. I checked that the parameters are not being updated. Can somebody please help me?

#softmax function
def softmax(y):
    return torch.exp(y)/(torch.sum(torch.exp(y)))

#Neural Network Class
class NeuralNetwork(nn.Module):
    def __init__(self,input_size, output_size, hidden_01_size = 128,hidden_02_size = 64 ):
        super(NeuralNetwork,self).__init__()
        
        #layer 0->1
        self.w01 = nn.Parameter(torch.rand(input_size,hidden_01_size), requires_grad = True)

        self.b01 = nn.Parameter(torch.rand(1,hidden_01_size), requires_grad = True)

        
        #layer 1->2
        self.w12 = nn.Parameter(torch.rand(hidden_01_size,hidden_02_size), requires_grad = True)

        self.b12 = nn.Parameter(torch.rand(1,hidden_02_size), requires_grad = True)

        
        #layer 2->3
        self.w23 = nn.Parameter(torch.rand(hidden_02_size, output_size), requires_grad = True)

        self.b23 = nn.Parameter(torch.rand(1,output_size), requires_grad = True)

        #self.params = nn.ParameterList([self.w01, self.b01, self.w12,self.b12, self.w23, self.b23])
        
    
    def forward(self, x):
        input_size = x.shape[0]*x.shape[1]
        input = torch.tensor(x)
        input = torch.reshape(input,(-1,input_size))
        
        #layer 0->1
        hidden_01 = F.sigmoid(torch.matmul(input,self.w01)+self.b01)
        
        #layer 1->2
        hidden_12 = F.sigmoid(torch.matmul(hidden_01,self.w12)+self.b12)
        
        #layer 2->3
        output = softmax(F.sigmoid(torch.matmul(hidden_12,self.w23)+self.b23))
        return output
    
    def predict(self, X):
        pred = self.forward(X)
        return pred.argmax()
    def debug_w(self):
        print(f"bias weight: {self.b01}\n")
    
        
#training loop
def train(X, Y, epochs = 100, learning_rate = 0.05):
    optimizer = SGD(net.parameters(), lr = learning_rate)
    for epoch in range(epochs):
        total_loss = 0
        #net.debug_w()
        for i in range(X.shape[0]):
            x = X[i]
            y = Y[i]
            y_ohot = np.zeros((1,6))
            y_ohot[0,y]=1
            target = torch.tensor(y_ohot)
            output = net.forward(x)
            loss = ((output-target)**2).sum()
            
            optimizer.zero_grad()
            list(net.parameters())
            loss.backward()
            optimizer.step()
            total_loss+= float(loss)
        if(total_loss<0.00001):
            print(f"In iteration {epoch} broke\n")
            break
            
        if(epoch%2==0):
            print(f"Iteration is {epoch+1} : total loss is {total_loss}\n")
            #net.debug_w()

Before getting to loss, there are a few things that could be improved:

  1. If you are performing a classfication task (predicting label probabilities), I would use cross-entropy loss. In pytorch, the builtin nn.CrossEntropyLoss module takes raw output from a model, before any softmax is applied. So if you use the pytorch module (I think you should), you shouldn’t run the final layer through softmax nor sigmoid. To be honest, I can’t think of a good reason to apply sigmoid and then softmax on any layer in the first place. Sigmoid squeezes a single raw value to be between 0 and 1 . Softmax squeezes a set of raw values to each be between 0 and 1, but add up to 1. When you use both together, you are still getting outputs between 0 and 1 that add up to 1, but you have unknowingly added other non-useful constraints. It looks like you implemented a custom MSE loss within the training loop. There is a pytorch version nn.MSELoss module, which I recommend using for regression problems. Again, this looks like a classficiation problem, so I would use nn.CrossEntropyLoss.
  2. You should use the pytorch nn.Linear module for linear layers in the model. You can find many examples of the pytorch Linear module being used in neural networks. It automatically initializes both weights and biases.
  3. It looks like you are performing stochastic gradient descent (gradient descent steps taken on part of the training data instead of on all training data at once). This is a good method if your data is too large, but you are taking gradient descent steps on a single data point at a time. In your training loop, for each epoch you should randomly split the data into a set of batches, and run a whole batch through gradient descent steps instead of single data points. I would read up on epochs, batches, and stochastic gradient descent.

Other miscellaneous notes:

  • When saving the total loss, it might make more sense to store the loss of each step in a list. So define total_loss=[] before training and total_loss.append(loss.item()) at each step. This way, you can see how the loss of the model at each step is different. The loss should go down. If you sum the total loss over time, it can only increase
  • A learning rate of .05 might be pretty high, depending on the application. You could try lowering this to maybe 1e-3.
  • You should check out the torch.flatten function for converting a 2D image tensor into a 1D vector tensor.
  • It is also helpful to track accuracy alongside loss for classification problems.
  • You do not need to call net.forward(X). When using a pytorch module, you can simply call net(X) to get a forward pass.
  • You don’t need one-hot encoding for cross entropy loss. It takes a matrix of raw predicted values (converted to probabilities with the CrossEntropyLoss module) and a vector of true classes (as integers).

Making these changes alone could resolve the loss issue.

#Neural Network Class
class NeuralNetwork(nn.Module):
  def __init__(self,input_size, output_size, hidden_01_size = 128,hidden_02_size = 64 ):
    super(NeuralNetwork,self).__init__()
    
    self.layer1 = nn.Linear(input_size, hidden_01_size)
    self.layer2 = nn.Linear(hidden_01_size, hidden_02_size)
    self.layer3 = nn.Linear(hidden_02_size, output_size)
  
  def forward(self, x):
    x = torch.flatten(x, start_dim=1)
    x = torch.sigmoid(self.layer1(x))
    x = torch.sigmoid(self.layer2(x))
    out = self.layer3(x)

    return out

  def predict(self, x):
    pred = self.forward(x)
    return pred.argmax(1)

net = NeuralNetwork(48**2, 10)
print(net(torch.rand(5, 48, 48)).shape)

#training loop
def train(X, Y, epochs = 100, learning_rate = .05, batch_size=32):
  optimizer = torch.optim.SGD(net.parameters(), lr = learning_rate)
  criterion = nn.CrossEntropyLoss()
  idx = np.array(range(X.shape[0]))
  num_batches = np.ceil(idx.shape[0] / batch_size).astype(int)

  for epoch in range(epochs):
    total_loss = []
    
    np.random.shuffle(idx)

    for i in range(num_batches):
      batch_idx = range(i*batch_size, min((i+1)*batch_size, X.shape[0]))
      x = X[batch_idx]
      y = Y[batch_idx]
      output = net(x)
      loss = criterion(output, y)
      
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

      total_loss += [loss.item()]
          
    if(epoch%2==0):
      print(f"Epoch is {epoch+1}: loss is {loss.item()}\n")