Loss increases during training after checkpoint save

Glock · January 31, 2020, 12:35pm

This wierd thing is happening which may be connected to my other post.
During training I’m saving current model as checkpoints just in case of code failing. But when the training continues after saving loss value jumps up.

This is the example log:

2020-01-31 12:22:00,765 [MainThread  ] [INFO ]  Epoch    92/  400, train_loss:  0.75990, test_loss:  0.77233, accu: 0.985551055, background accuracy: 0.994, frame accuracy: 0.983, feet accuracy: 0.958, defect accuracy: 0.181
2020-01-31 12:22:31,782 [MainThread  ] [INFO ]  Epoch    93/  400, train_loss:  0.75946, test_loss:  0.76164, accu: 0.985618570, background accuracy: 0.993, frame accuracy: 0.985, feet accuracy: 0.958, defect accuracy: 0.965
2020-01-31 12:23:02,886 [MainThread  ] [INFO ]  Epoch    94/  400, train_loss:  0.75892, test_loss:  0.76675, accu: 0.986002443, background accuracy: 0.994, frame accuracy: 0.983, feet accuracy: 0.966, defect accuracy: 0.531
2020-01-31 12:23:33,823 [MainThread  ] [INFO ]  Epoch    95/  400, train_loss:  0.75963, test_loss:  0.76336, accu: 0.984352495, background accuracy: 0.993, frame accuracy: 0.985, feet accuracy: 0.956, defect accuracy: 0.959
2020-01-31 12:24:04,805 [MainThread  ] [INFO ]  Epoch    96/  400, train_loss:  0.75858, test_loss:  0.76287, accu: 0.984805170, background accuracy: 0.992, frame accuracy: 0.985, feet accuracy: 0.965, defect accuracy: 0.952
2020-01-31 12:24:35,937 [MainThread  ] [INFO ]  Epoch    97/  400, train_loss:  0.75837, test_loss:  0.77181, accu: 0.986309156, background accuracy: 0.994, frame accuracy: 0.981, feet accuracy: 0.967, defect accuracy: 0.023
2020-01-31 12:25:06,901 [MainThread  ] [INFO ]  Epoch    98/  400, train_loss:  0.75835, test_loss:  0.76104, accu: 0.985957433, background accuracy: 0.994, frame accuracy: 0.984, feet accuracy: 0.956, defect accuracy: 0.965
2020-01-31 12:25:37,838 [MainThread  ] [INFO ]  Epoch    99/  400, train_loss:  0.75793, test_loss:  0.76226, accu: 0.986363169, background accuracy: 0.993, frame accuracy: 0.984, feet accuracy: 0.963, defect accuracy: 0.907
2020-01-31 12:26:08,867 [MainThread  ] [INFO ]  Epoch   100/  400, train_loss:  0.75770, test_loss:  0.76353, accu: 0.986570216, background accuracy: 0.994, frame accuracy: 0.983, feet accuracy: 0.962, defect accuracy: 0.786
2020-01-31 12:26:09,404 [MainThread  ] [INFO ]  Epoch model saved: ./drive/My Drive/nnsegmentation-unet/output//checkpoint/checkpoint-epoch_100_time_20200131_122608
2020-01-31 12:26:40,630 [MainThread  ] [INFO ]  Epoch   101/  400, train_loss:  1.18036, test_loss:  1.17363, accu: 0.739311986, background accuracy: 1.000, frame accuracy: 0.000, feet accuracy: 0.000, defect accuracy: 0.000
2020-01-31 12:27:11,798 [MainThread  ] [INFO ]  Epoch   102/  400, train_loss:  0.78324, test_loss:  1.17076, accu: 0.739137088, background accuracy: 1.000, frame accuracy: 0.000, feet accuracy: 0.000, defect accuracy: 0.000
2020-01-31 12:27:42,808 [MainThread  ] [INFO ]  Epoch   103/  400, train_loss:  0.76286, test_loss:  0.77436, accu: 0.984926698, background accuracy: 0.995, frame accuracy: 0.972, feet accuracy: 0.959, defect accuracy: 0.105
2020-01-31 12:28:13,911 [MainThread  ] [INFO ]  Epoch   104/  400, train_loss:  0.75993, test_loss:  0.76171, accu: 0.985947145, background accuracy: 0.993, frame accuracy: 0.984, feet accuracy: 0.962, defect accuracy: 0.962
2020-01-31 12:28:45,007 [MainThread  ] [INFO ]  Epoch   105/  400, train_loss:  0.76013, test_loss:  0.76190, accu: 0.985745885, background accuracy: 0.993, frame accuracy: 0.985, feet accuracy: 0.958, defect accuracy: 0.940

This is my saving function:

def save_checkpoint(epoch, network, optimizer, scheduler, loss, out_file):
    save_dict = {
        'epoch': epoch,
        'model': network,
        'model_state_dict': network.state_dict(),
        'optimizer': optimizer.state_dict(),
        'scheduler': scheduler.state_dict(),
        'loss': loss
    }
    torch.save(save_dict, out_file)

        for epoch in range(1, num_epochs + 1):
            running_loss = 0.0
            for batch_index, sample in enumerate(data_loader):
                optimizer.zero_grad()
                outputs = network(sample["image"].cuda())
                loss = criterion(outputs, sample["mask"].float().cuda(), sample["weights"].float().cuda())
                loss.backward()
                optimizer.step()

                running_loss += loss.item()

            running_loss = running_loss / len(data_loader)
            test_loss, accuracy, conf = perform_test_network(valid_dataset, network, criterion, batch_size=1)
            scheduler.step(test_loss)

            LOGGER.info(
                f'Epoch {epoch: 5d}/{num_epochs: 5d}, '
                f'train_loss: {running_loss:8.5f}, '
                f'test_loss: {test_loss:8.5f}, accu: {accuracy:2.9f}, '
                f'background accuracy: {conf[0, 0]:2.3f}, '
                f'frame accuracy: {conf[1, 1]:2.3f}, '
                f'feet accuracy: {conf[2, 2]:2.3f}, '
                f'defect accuracy: {conf[3, 3]:2.3f}')
            
            if epoch % 100 == 0 or epoch == num_epochs:
                out_file = os.path.join(dirs['checkpoint'], f'checkpoint-epoch_{epoch}_time_{nice_time()}')
                save_checkpoint(epoch, network, optimizer, scheduler, loss, out_file)
                LOGGER.info(f"Epoch model saved: {out_file}")
                save_checkpoint(epoch, network, optimizer, scheduler, loss, last_checkpoint_path)
                perform_test_and_visualize(network, train_dataset, dirs, epoch, batch_size=2)

ptrblck · February 1, 2020, 7:29am

When are you calling the save_checkpoint method?
Are you setting the model to .eval() and back to train()?

Glock · February 3, 2020, 3:01pm

Yes, thank you, looks like in the function perform_test_and_visualize I was setting model to eval and forgot to set it back to train.