My basic model doesn't learn, can someone help me?

hananelroe · March 21, 2022, 9:17am

the code does some basic backprop, than calculates the avarage loss for the first and second half. the loss stays about the same… I think I messed up with the backpropagation somehow?

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

device = "gpu" if torch.cuda.is_available() else "cpu"
print(f"device: {device}")

class Network(nn.Module):
    def __init__(self):
        super(Network, self).__init__()
        # layers:
        self.L1 = nn.Linear(1, 4)
        self.L2 = nn.Linear(4, 8)
        self.L3 = nn.Linear(8, 10)
        self.L4 = nn.Linear(10, 10)
        self.L5 = nn.Linear(10, 1)

    def forward(self, x):
        x = torch.sigmoid(self.L1(x))
        x = torch.sigmoid(self.L2(x))
        x = torch.sigmoid(self.L3(x))
        x = torch.sigmoid(self.L4(x))
        x = self.L5(x)
        x = F.softmax(x, dim=0)
        return x


firstHalf = 0
secondHalf = 0
net = Network().to(device)
optimizer = optim.SGD(net.parameters(), lr=0.01)
criterion = nn.MSELoss()
epochs = 500

while epochs > 0:
    input = torch.rand(1)

    # determine target according to the input: 1 for >0.5, 0 for <0.5
    if input[0] < 0.5:
        target = torch.zeros(1)  # tensor([0.])
    else:
        target = torch.ones(1)  # tensor([1.])

    optimizer.zero_grad()  # zero the gradient buffers
    out = net(input)
    loss = criterion(out, target)
    loss.backward()
    optimizer.step()  # Does the update

    # sum the loss for each half:
    if epochs < 251:
        firstHalf += loss
    else:
        secondHalf += loss

    epochs -= 1

print(f"first half: {firstHalf/250:.4f}\n"
      f"second half: {secondHalf/250:.4f}")

blueeagle · March 21, 2022, 12:35pm

The problem is the softmax layer at the end of your model. First of all I don’t think a softmax layer makes any sense here at all but besides that you are calculating the softmax for one single value which is always evaluated to 1:

softmax(x)_i = e^x_i / SUM_j e^x_j
# if x is a scalar: x = x_i, leading to
softmax(x)_i = e^x_i / e^x_i = 1

Therefore your model can’t do anything other than predicting 1. If you add a print statement for your output you will see that they are always equal to 1. Try remove the softmax layer, that should help.

Moreover there are a few issues with your code:

you should change if epochs < 251 to if epochs > 251
consider using a for loop instead of a while loop
try to calculate input and label as follows:

input = torch.rand(size=(64,1)) (this gives you a mini batch of 64 samples)
target = (input<0.5).float()

A cleaner version of your code could look like that:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import matplotlib.pyplot as plt

device = "gpu" if torch.cuda.is_available() else "cpu"
print(f"device: {device}")

class Network(nn.Module):
    def __init__(self):
        super(Network, self).__init__()
        self.L1 = nn.Linear(1, 4)
        self.L2 = nn.Linear(4, 8)
        self.L3 = nn.Linear(8, 10)
        self.L4 = nn.Linear(10, 10)
        self.L5 = nn.Linear(10, 1)

    def forward(self, x):
        x = torch.sigmoid(self.L1(x))
        x = torch.sigmoid(self.L2(x))
        x = torch.sigmoid(self.L3(x))
        x = torch.sigmoid(self.L4(x))
        x = self.L5(x)
        return x

net = Network().to(device)
optimizer = optim.SGD(net.parameters(), lr=0.01)
criterion = nn.MSELoss()
epochs = 500

loss_hist = []
net.train()
for epoch in range(1, epochs+1):
    input = torch.rand(size=(64,1))
    target = (input<0.5).float()
    optimizer.zero_grad()
    out = net(input)
    loss = criterion(out, target)
    loss.backward()
    optimizer.step()
    loss_hist.append(loss.item())
plt.plot(loss_hist)
plt.show()

hananelroe · March 21, 2022, 1:41pm

thanks, it works! just one question(if you don’t mind):
what does net.train() do? I thought we’re training it in the for epoch loop?

blueeagle · March 21, 2022, 1:52pm

net.train() puts all layers of the network to train mode, in opposite to net.eval(), what puts them in evaluation mode. The reason for that is that some layers work different during training and evaluation, as for example batch normalization layers. It is important to notice that net.train() does not train your model but in some sense prepares your model for training.

I don’t think that net.train() is even necessary here, since you are only using linear and sigmoid layers, which (as far as I know) work in the exact same way during training and evaluation. However I think it is a good think to always add that to your code, otherwise you might forget it one day when it would be necessary