Loss goes from 0.00xx to xxxx.xx

Diaa_elsayed_ziada · March 30, 2020, 5:58pm

I’m working on 3d landmark detection model
the training process works very well for el first 2 epochs and at the middle of the epoch 3 the loss starts to increase in an insane way and model results get much worth than it’s previous
what could be the reason of this

def train(train_loader, model, optimizer, criterion, controller):
    model.train()
    t = time()
    train_loss = controller.current_train_loss
    for i, data in enumerate(train_loader):
        images = data['image']
        if images.shape[1] == 0:
            print(f"\rTaining finished batches {i+1}/{len(train_loader)}   {int(((i+1)/len(train_loader))*100)}%", end='')
            continue
        targets = data['keypoints']
        targets = targets.view(targets.size(0), -1)
        
        images = images.type(torch.FloatTensor)
        targets = targets.type(torch.FloatTensor)
        
        images = images.to(device)
        targets = targets.to(device)
        
        optimizer.zero_grad()
        
        output = model(images)#[0] for 2d model
        loss = criterion(output, targets)
        loss.backward()
        optimizer.step()
        
        train_loss += 1/(i + 1) * (loss.data - train_loss)
        
        # print(targets.shape, output.shape)

        del images
        del targets
        torch.cuda.empty_cache()
        
        delay = time() - t

        print(f"\rTaining finished batches {i+1}/{len(train_loader)}   {int(((i+1)/len(train_loader))*100)}%   delay {int(delay)}s   time left {int(delay* (len(train_loader)-(i+1)))}s   loss { loss.item()}",end='\n')
        if i%10 == 0:
            controller.update_batch_info(i, model, optimizer, train_loss)
        t = time()
        
    return train_loss

albanD · March 30, 2020, 6:16pm

Hi,

I am not a specialist but the usual reasons I have seen are:

learning rate too high
loss function is non-smooth for low values

This is most likely the second one here if it behaved properly at the beginning of the training.
You want to make sure that small loss value don’t make anything numerically unstable.

Diaa_elsayed_ziada · March 30, 2020, 7:40pm

could tell me more about (loss function is non-smooth for low values)
how to deal with to continue my training
i’m using mse loss

albanD · March 30, 2020, 8:30pm

If the loss itself is MSE, it should be fine.
Do you have special layers in your net?

The way to continue training is just to make sure you don’t have these instabilities.

Diaa_elsayed_ziada · March 30, 2020, 8:44pm

no special layers only a deep CNN network
and i worked well at first 3 epochs and this happens after 47% of the epoch 4

i reduced learning rate from 0.001 to 0.0001, and eps from 1e-8 to ie-7 and restart training at epoch 4
it’s working fine until now
if it start to do it again i will follow up with you
thanks for following up