Loss remains the same during training

I’m facing a weird problem using the latest PyTorch version (1.12.1) in a JupyterLab environment.

I have this model definition:

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        
        self.fc_1 = nn.Linear(1, 100)
        self.fc_2 = nn.Linear(100, 1)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        x = self.fc_1(x)
        x = self.relu(x)
        x = self.fc_2(x)
        return self.relu(x)

net = MyModel()

The objective:

mse = torch.nn.MSELoss()

And the optimizer:

optimizer = torch.optim.SGD(net.parameters(), lr=1e-2)

I’m training the model using the following simple training loop:

def train(epochs, x, y):
    for epoch in range(epochs):
        # forward pass
        pred = net(x)
        # compute the loss
        loss = mse(pred, y)
        # backward pass
        loss.backward()
        # optimization step
        optimizer.step()
        # zero out the gradients to avoid gradient accumulation
        optimizer.zero_grad()

        print(f"Epoch: {epoch}\t Loss: {loss}")

train(5)

The really weird issue is that the first time I run the process, the model is not learning anything, and the loss remains the same. However, if I run the cell with the model definition and instantiation again, the loss function and the optimizer everything works.

Any clues?

Hi Dimitris!

Could you post a short, fully-self-contained, runnable script that reproduces
your issue together with the output you get?

(It seems like you’ve already posted most of what would be in such a script.)

This looks odd here. What are x and y? In range (epochs, y, y), x and
y would typically be integers, while in net (x) and mse (pred, y), x and
y would typically be pytorch tensors. I don’t see how this code could run.

Best.

K. Frank

@KFrank, you’re right. This was a copy-paste error. I edited the example. x and y are two PyTorch tensors and they are now moved to the train function’s signature. I obtain them like this:

x = torch.linspace(-2, 2, steps=20)[:, None]
x = x.float()
y = add_noise(f(x), .3, 1.5)
y = y.float()

where f is

def f(x):
    return 3*x**2 + 2*x + 1

and add_noise is

def noise(x, scale):
    return np.random.normal(scale=scale, size=x.shape)

def add_noise(x, mult, add):
    return x * (1 + noise(x, mult)) + noise(x, add)

This sounds like a flaky training where the convergence fails randomly. Did you try rerunning the code using different seeds to check the convergence rate?

Also, based on your post it seems the output could be negative if the noise is large enough while the last relu usage would clip the output so you might want to remove it.