Loss always converges to around 1

I have no idea why this is happening…

I built this model a couple of days ago and it worked with good training and predictions.

However I opened the file today as I needed to add some things to the code (unrelated to the model) and now when I train it the loss stays at around 1 no matter how long I train it for - 10, 100, 1000, 10000 epochs.

I literally have not changed anything.

Anyway here is the relevant code:

class FundedDateNN(nn.Module):
    
    def __init__(self, input_size, hidden_size, output_size=1):
        super().__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)
        self.init_weights()
    
    def init_weights(self):
        initrange = 0.5
        self.fc1.weight.data.uniform_(-initrange, initrange)
        self.fc1.bias.data.zero_()
        self.fc2.weight.data.uniform_(-initrange, initrange)
        self.fc2.bias.data.zero_()
        
    def forward(self, x):
        x = self.fc1(x)
        return self.fc2(x)
    
    def predict(self,x):
        return self.forward(x)
hidden_size = 20

model = FundedDateNN(input_size, hidden_size) 
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
def train(model, dataloader, num_epochs):

    model.train()
    losses = list()
    ts = time.time()
    for epoch in range(num_epochs):
        epoch_losses = list()
        for idx, (X, y) in enumerate(trainloader):

            optimizer.zero_grad()
            out = model(X)
            loss = criterion(out.squeeze(), y)
            epoch_losses.append(loss.item())

            loss.backward()
        optimizer.step()
        losses.append(np.mean(epoch_losses))
        print('Epoch: {}, loss: {}'.format(epoch, np.mean(epoch_losses)))
    te = time.time()
    fig, ax = plt.subplots()
    ax.plot(range(num_epochs), losses)
    plt.show()
    mins = int((te-ts) / 60)
    secs = int((te-ts) % 60)
    print('Training completed in {} minutes, {} seconds.'.format(mins, secs))
    return losses, model
n_epochs = 100
losses, model = train(model, trainloader, n_epochs)

I would really, really appreciate some help with this.

Your model can be seen as a single linear layer, since you are not using any activation function between fc1 and fc2 so you might want to add it.
Also, make sure that out.squeeze() and y have the same shape as otherwise unwanted broadcasting could be applied and you should get a warning about this behavior.

@ptrblck

Thanks for your response!

Yes I understand that there is no activation between the two layers. I have tried a couple: Tanh, Relu and Sigmoid that all yield a similar output.

And yes, I have that call there to ensure they are the same shape.

Could it be a problem with the data itself?

Could be the case, e.g. if the ranges are large and you don’t normalize the data.
I assume that you don’t expect mismatches in the data, which could break the training.

As said before, your model is quite “small” (single linear layer), which might not be suitable to fit the data.

@ptrblck

After some further experimentation, I found that it doesn’t actually converge to 1. More specifically, it just learns very very slowly.

If when I start the training the loss is around 0.96. It drops only about 0.08 in 30,000 epochs.

I have been adjusting the learning rate as well as other hyperparameters. What steps can I take to improve how quickly the model learns?

I am Z scoring the data before training.

Yeah the model is quite small, I have since added a Tanh between tb two linear layers. How might I improve the model architecture?

I’m trying to solve a regression problem using a combination of continuous and categorical features.

Thank you so much for your help.

You could search for similar use cases and check if some architectures were successfully used before.
If you cannot find such a model, you could try to run experiments and check which change in the architecture is beneficial for the use case. There are also meta-learning approaches, which could try to find a suitable architecture, but I’m unsure if the current methods would suite your use case.
Anyway, lectures such as FastAI provide also some best practices to create models.

@ptrblck
After doing some research and making some changes to the model.

I have updated my model architecture as follows. It’s task is now to perform binary classification.

class StatusNN(nn.Module):
    
    def __init__(self, input_size, hidden_size, num_classes=1):
        super().__init__()
        self.fc1 = nn.Linear(input_size, 512)
        self.fc2 = nn.Linear(512, hidden_size)
        self.fc3 = nn.Linear(hidden_size, num_classes)
        self.sigmoid = nn.Sigmoid()
        self.relu = nn.ReLU()
    def forward(self, x):
        x = self.fc1(x)      # [batch_size, 512] 
        x = self.relu(x)     # [batch_size, 512]
        x = self.fc2(x)      # [batch_size, 64]
        x = self.relu(x)     # [batch_size, 64]
        x = self.fc3(x)      # [batch_size, 1]
        x = self.sigmoid(x)  # [batch_size, 1]
        return x
    
    def predict(self,x):
        out = self.forward(x).detach().numpy()
        return np.round(out)

I initiate the model, loss_fn and optmizer:

USE_CUDA = False #torch.cuda.is_available()
device = 'cuda' if USE_CUDA else 'cpu'

input_size = dataset_x.shape[1]
hidden_size = 512
lr = 1e-3

model = StatusNN(input_size, hidden_size).to(device)
criterion = nn.BCELoss().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.6)

And this is my training loop:

def train(model, dataloader, num_epochs):

    model.train()
    losses = list()
    ts = time.time()
    for epoch in range(num_epochs):
        epoch_losses = list()
        for idx, (X, y) in enumerate(trainloader):
            
            X, y = X.to(device), y.to(device)

            out = model(X)
            loss = criterion(out, y.unsqueeze(1))
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            epoch_losses.append(loss.item())
            
        losses.append(np.mean(epoch_losses))
        if epoch % 2 == 0:
            print('Epoch: {}, Loss: {}'.format(epoch, np.mean(epoch_losses)))
    te = time.time()
    
    fig, ax = plt.subplots()
    ax.plot(range(num_epochs), losses)
    plt.show()
    
    mins = int((te-ts) / 60)
    secs = int((te-ts) % 60)
    print('Training completed in {} minutes, {} seconds.'.format(mins, secs))
    
    return losses, model

However, even after all of this I still get this asymptotic, plateauing effect with my loss as it is not converging to 0, but some arbitrary value.

Could the issue lie in my training loop?

I have tried overfitting the model on the first batch to the same result. I have also tried balancing my dataset using SMOTE.

Thank you so much for your help so far, I really appreciate it.

Your model seems to work for random data, as I’m able to perfectly overfit a random input and target:

class StatusNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes=1):
        super().__init__()
        self.fc1 = nn.Linear(input_size, 512)
        self.fc2 = nn.Linear(512, hidden_size)
        self.fc3 = nn.Linear(hidden_size, num_classes)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.fc1(x)      # [batch_size, 512] 
        x = self.relu(x)     # [batch_size, 512]
        x = self.fc2(x)      # [batch_size, 64]
        x = self.relu(x)     # [batch_size, 64]
        x = self.fc3(x)      # [batch_size, 1]
        return x

model = StatusNN(10, 512)
data = torch.randn(64, 10)
target = torch.randint(0, 2, (64, 1)).float()
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(100):
    optimizer.zero_grad()
    output = model(data)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()
    print('Epoch {}, loss {}'.format(epoch, loss.item()))

model.eval()
output = model(data)
pred = torch.sigmoid(output) > 0.5
print((pred == target).float().sum() / target.size(0))
> tensor(1.)

Note that I replaced sigmoid + nn.BCELoss with logits + nn.BCEWithLogitsLoss for more numerical stability.