Loss get stucks in one epoch

887574002 · August 11, 2020, 3:48pm

Hi, I have a problem with running my code. I have a program that sometimes it gets stuck in some epochs and dose not procced anymore. How can I understand where the problem is that I can fix it. The problem is that it happens sometimes not always. Any help is really appreciated.

Thank you in advance.

prateekgupta891 · August 11, 2020, 9:18pm

Are you monitoring the loss function? my possible guess would be - loss becoming NaN.
What do you mean by stuck though, the training stops or something else?
Please post code and output screenshots to understand your issue.

887574002 · August 12, 2020, 8:38am

Hi, if you mean by monitoring loss function, printing it. Yes, exactly. It’s something I am doing. My mean by getting stuck at some epoch is that I do not see any loss is printed anymore while it seems that the code is running. I did different experiments, and I noticed that it dose not happen in any certain epoch for example in the following experiment the loss got stuck in epoch 7 and it did not get ride of this step, until I had to restart the kernel

Epoch: 0 Train_Loss: 2.3234, Val_Loss: 1.8912
Epoch: 1 Train_Loss: 1.2648, Val_Loss: 1.7110
Epoch: 2 Train_Loss: 1.1671, Val_Loss: 1.6037
Epoch: 3 Train_Loss: 1.1116, Val_Loss: 1.5056
Epoch: 4 Train_Loss: 1.2287, Val_Loss: 1.6071
Epoch: 5 Train_Loss: 1.3418, Val_Loss: 1.6923
Epoch: 6 Train_Loss: 1.2899, Val_Loss: 1.7839
Epoch: 7 Train_Loss: 1.4315, Val_Loss: 1.7068

In another experiment, it happened at the last epoch. I mean epoch 39 in 40 epoch running. Here you can see the results.

Epoch: 13 Train_Loss: 1.4405, Val_Loss: 1.8422
Epoch: 14 Train_Loss: 1.5998, Val_Loss: 2.1091
Epoch: 15 Train_Loss: 1.8484, Val_Loss: 2.0408
Epoch: 16 Train_Loss: 1.7372, Val_Loss: 2.0536
Epoch: 17 Train_Loss: 1.6792, Val_Loss: 2.3075
Epoch: 18 Train_Loss: 1.6514, Val_Loss: 1.9093
Epoch: 19 Train_Loss: 1.7062, Val_Loss: 1.8253
Epoch: 20 Train_Loss: 1.8381, Val_Loss: 2.0517
Epoch: 21 Train_Loss: 1.9149, Val_Loss: 2.0542
Epoch: 22 Train_Loss: 2.0586, Val_Loss: 2.4727
Epoch: 23 Train_Loss: 2.0700, Val_Loss: 2.3520
Epoch: 24 Train_Loss: 2.1251, Val_Loss: 2.1675
Epoch: 25 Train_Loss: 2.2762, Val_Loss: 2.1867
Epoch: 26 Train_Loss: 2.2843, Val_Loss: 2.5865
Epoch: 27 Train_Loss: 2.3018, Val_Loss: 2.3153
Epoch: 28 Train_Loss: 2.3720, Val_Loss: 2.4467
Epoch: 29 Train_Loss: 2.2131, Val_Loss: 2.2533
Epoch: 30 Train_Loss: 2.2144, Val_Loss: 2.6644
Epoch: 31 Train_Loss: 2.2345, Val_Loss: 2.3036
Epoch: 32 Train_Loss: 2.0849, Val_Loss: 2.2778
Epoch: 33 Train_Loss: 2.1355, Val_Loss: 2.9235
Epoch: 34 Train_Loss: 2.4909, Val_Loss: 2.5539
Epoch: 35 Train_Loss: 2.5277, Val_Loss: 2.3989
Epoch: 36 Train_Loss: 2.6098, Val_Loss: 2.6083
Epoch: 37 Train_Loss: 2.6496, Val_Loss: 2.5968
Epoch: 38 Train_Loss: 2.5559, Val_Loss: 2.6308

The code is so huge that I can not send it here. Just it is an unsupervised learning and I am working with 3d medical images. So, the batch size here is as small as 2. So, my colleague told that I should have expected to see some fluctuations in loss but not actually getting stuck.