This wierd thing is happening which may be connected to my other post.
During training I’m saving current model as checkpoints just in case of code failing. But when the training continues after saving loss value jumps up.
This is the example log:
2020-01-31 12:22:00,765 [MainThread ] [INFO ] Epoch 92/ 400, train_loss: 0.75990, test_loss: 0.77233, accu: 0.985551055, background accuracy: 0.994, frame accuracy: 0.983, feet accuracy: 0.958, defect accuracy: 0.181
2020-01-31 12:22:31,782 [MainThread ] [INFO ] Epoch 93/ 400, train_loss: 0.75946, test_loss: 0.76164, accu: 0.985618570, background accuracy: 0.993, frame accuracy: 0.985, feet accuracy: 0.958, defect accuracy: 0.965
2020-01-31 12:23:02,886 [MainThread ] [INFO ] Epoch 94/ 400, train_loss: 0.75892, test_loss: 0.76675, accu: 0.986002443, background accuracy: 0.994, frame accuracy: 0.983, feet accuracy: 0.966, defect accuracy: 0.531
2020-01-31 12:23:33,823 [MainThread ] [INFO ] Epoch 95/ 400, train_loss: 0.75963, test_loss: 0.76336, accu: 0.984352495, background accuracy: 0.993, frame accuracy: 0.985, feet accuracy: 0.956, defect accuracy: 0.959
2020-01-31 12:24:04,805 [MainThread ] [INFO ] Epoch 96/ 400, train_loss: 0.75858, test_loss: 0.76287, accu: 0.984805170, background accuracy: 0.992, frame accuracy: 0.985, feet accuracy: 0.965, defect accuracy: 0.952
2020-01-31 12:24:35,937 [MainThread ] [INFO ] Epoch 97/ 400, train_loss: 0.75837, test_loss: 0.77181, accu: 0.986309156, background accuracy: 0.994, frame accuracy: 0.981, feet accuracy: 0.967, defect accuracy: 0.023
2020-01-31 12:25:06,901 [MainThread ] [INFO ] Epoch 98/ 400, train_loss: 0.75835, test_loss: 0.76104, accu: 0.985957433, background accuracy: 0.994, frame accuracy: 0.984, feet accuracy: 0.956, defect accuracy: 0.965
2020-01-31 12:25:37,838 [MainThread ] [INFO ] Epoch 99/ 400, train_loss: 0.75793, test_loss: 0.76226, accu: 0.986363169, background accuracy: 0.993, frame accuracy: 0.984, feet accuracy: 0.963, defect accuracy: 0.907
2020-01-31 12:26:08,867 [MainThread ] [INFO ] Epoch 100/ 400, train_loss: 0.75770, test_loss: 0.76353, accu: 0.986570216, background accuracy: 0.994, frame accuracy: 0.983, feet accuracy: 0.962, defect accuracy: 0.786
2020-01-31 12:26:09,404 [MainThread ] [INFO ] Epoch model saved: ./drive/My Drive/nnsegmentation-unet/output//checkpoint/checkpoint-epoch_100_time_20200131_122608
2020-01-31 12:26:40,630 [MainThread ] [INFO ] Epoch 101/ 400, train_loss: 1.18036, test_loss: 1.17363, accu: 0.739311986, background accuracy: 1.000, frame accuracy: 0.000, feet accuracy: 0.000, defect accuracy: 0.000
2020-01-31 12:27:11,798 [MainThread ] [INFO ] Epoch 102/ 400, train_loss: 0.78324, test_loss: 1.17076, accu: 0.739137088, background accuracy: 1.000, frame accuracy: 0.000, feet accuracy: 0.000, defect accuracy: 0.000
2020-01-31 12:27:42,808 [MainThread ] [INFO ] Epoch 103/ 400, train_loss: 0.76286, test_loss: 0.77436, accu: 0.984926698, background accuracy: 0.995, frame accuracy: 0.972, feet accuracy: 0.959, defect accuracy: 0.105
2020-01-31 12:28:13,911 [MainThread ] [INFO ] Epoch 104/ 400, train_loss: 0.75993, test_loss: 0.76171, accu: 0.985947145, background accuracy: 0.993, frame accuracy: 0.984, feet accuracy: 0.962, defect accuracy: 0.962
2020-01-31 12:28:45,007 [MainThread ] [INFO ] Epoch 105/ 400, train_loss: 0.76013, test_loss: 0.76190, accu: 0.985745885, background accuracy: 0.993, frame accuracy: 0.985, feet accuracy: 0.958, defect accuracy: 0.940
This is my saving function:
def save_checkpoint(epoch, network, optimizer, scheduler, loss, out_file):
save_dict = {
'epoch': epoch,
'model': network,
'model_state_dict': network.state_dict(),
'optimizer': optimizer.state_dict(),
'scheduler': scheduler.state_dict(),
'loss': loss
}
torch.save(save_dict, out_file)
for epoch in range(1, num_epochs + 1):
running_loss = 0.0
for batch_index, sample in enumerate(data_loader):
optimizer.zero_grad()
outputs = network(sample["image"].cuda())
loss = criterion(outputs, sample["mask"].float().cuda(), sample["weights"].float().cuda())
loss.backward()
optimizer.step()
running_loss += loss.item()
running_loss = running_loss / len(data_loader)
test_loss, accuracy, conf = perform_test_network(valid_dataset, network, criterion, batch_size=1)
scheduler.step(test_loss)
LOGGER.info(
f'Epoch {epoch: 5d}/{num_epochs: 5d}, '
f'train_loss: {running_loss:8.5f}, '
f'test_loss: {test_loss:8.5f}, accu: {accuracy:2.9f}, '
f'background accuracy: {conf[0, 0]:2.3f}, '
f'frame accuracy: {conf[1, 1]:2.3f}, '
f'feet accuracy: {conf[2, 2]:2.3f}, '
f'defect accuracy: {conf[3, 3]:2.3f}')
if epoch % 100 == 0 or epoch == num_epochs:
out_file = os.path.join(dirs['checkpoint'], f'checkpoint-epoch_{epoch}_time_{nice_time()}')
save_checkpoint(epoch, network, optimizer, scheduler, loss, out_file)
LOGGER.info(f"Epoch model saved: {out_file}")
save_checkpoint(epoch, network, optimizer, scheduler, loss, last_checkpoint_path)
perform_test_and_visualize(network, train_dataset, dirs, epoch, batch_size=2)