Loss goes from 0.00xx to xxxx.xx

I’m working on 3d landmark detection model
the training process works very well for el first 2 epochs and at the middle of the epoch 3 the loss starts to increase in an insane way and model results get much worth than it’s previous
what could be the reason of this

def train(train_loader, model, optimizer, criterion, controller):
    t = time()
    train_loss = controller.current_train_loss
    for i, data in enumerate(train_loader):
        images = data['image']
        if images.shape[1] == 0:
            print(f"\rTaining finished batches {i+1}/{len(train_loader)}   {int(((i+1)/len(train_loader))*100)}%", end='')
        targets = data['keypoints']
        targets = targets.view(targets.size(0), -1)
        images = images.type(torch.FloatTensor)
        targets = targets.type(torch.FloatTensor)
        images = images.to(device)
        targets = targets.to(device)
        output = model(images)#[0] for 2d model
        loss = criterion(output, targets)
        train_loss += 1/(i + 1) * (loss.data - train_loss)
        # print(targets.shape, output.shape)

        del images
        del targets
        delay = time() - t

        print(f"\rTaining finished batches {i+1}/{len(train_loader)}   {int(((i+1)/len(train_loader))*100)}%   delay {int(delay)}s   time left {int(delay* (len(train_loader)-(i+1)))}s   loss { loss.item()}",end='\n')
        if i%10 == 0:
            controller.update_batch_info(i, model, optimizer, train_loss)
        t = time()
    return train_loss


I am not a specialist but the usual reasons I have seen are:

  • learning rate too high
  • loss function is non-smooth for low values

This is most likely the second one here if it behaved properly at the beginning of the training.
You want to make sure that small loss value don’t make anything numerically unstable.

could tell me more about (loss function is non-smooth for low values)
how to deal with to continue my training
i’m using mse loss

If the loss itself is MSE, it should be fine.
Do you have special layers in your net?

The way to continue training is just to make sure you don’t have these instabilities.

no special layers only a deep CNN network
and i worked well at first 3 epochs and this happens after 47% of the epoch 4

i reduced learning rate from 0.001 to 0.0001, and eps from 1e-8 to ie-7 and restart training at epoch 4
it’s working fine until now
if it start to do it again i will follow up with you
thanks for following up