Loss get stucks in one epoch

Hi, I have a problem with running my code. I have a program that sometimes it gets stuck in some epochs and dose not procced anymore. How can I understand where the problem is that I can fix it. The problem is that it happens sometimes not always. Any help is really appreciated.

Thank you in advance.

Are you monitoring the loss function? my possible guess would be - loss becoming NaN.
What do you mean by stuck though, the training stops or something else?
Please post code and output screenshots to understand your issue.

Hi, if you mean by monitoring loss function, printing it. Yes, exactly. It’s something I am doing. My mean by getting stuck at some epoch is that I do not see any loss is printed anymore while it seems that the code is running. I did different experiments, and I noticed that it dose not happen in any certain epoch for example in the following experiment the loss got stuck in epoch 7 and it did not get ride of this step, until I had to restart the kernel

Epoch: 0 Train_Loss: 2.3234, Val_Loss: 1.8912
Epoch: 1 Train_Loss: 1.2648, Val_Loss: 1.7110
Epoch: 2 Train_Loss: 1.1671, Val_Loss: 1.6037
Epoch: 3 Train_Loss: 1.1116, Val_Loss: 1.5056
Epoch: 4 Train_Loss: 1.2287, Val_Loss: 1.6071
Epoch: 5 Train_Loss: 1.3418, Val_Loss: 1.6923
Epoch: 6 Train_Loss: 1.2899, Val_Loss: 1.7839
Epoch: 7 Train_Loss: 1.4315, Val_Loss: 1.7068

In another experiment, it happened at the last epoch. I mean epoch 39 in 40 epoch running. Here you can see the results.

Epoch: 13 Train_Loss: 1.4405, Val_Loss: 1.8422
Epoch: 14 Train_Loss: 1.5998, Val_Loss: 2.1091
Epoch: 15 Train_Loss: 1.8484, Val_Loss: 2.0408
Epoch: 16 Train_Loss: 1.7372, Val_Loss: 2.0536
Epoch: 17 Train_Loss: 1.6792, Val_Loss: 2.3075
Epoch: 18 Train_Loss: 1.6514, Val_Loss: 1.9093
Epoch: 19 Train_Loss: 1.7062, Val_Loss: 1.8253
Epoch: 20 Train_Loss: 1.8381, Val_Loss: 2.0517
Epoch: 21 Train_Loss: 1.9149, Val_Loss: 2.0542
Epoch: 22 Train_Loss: 2.0586, Val_Loss: 2.4727
Epoch: 23 Train_Loss: 2.0700, Val_Loss: 2.3520
Epoch: 24 Train_Loss: 2.1251, Val_Loss: 2.1675
Epoch: 25 Train_Loss: 2.2762, Val_Loss: 2.1867
Epoch: 26 Train_Loss: 2.2843, Val_Loss: 2.5865
Epoch: 27 Train_Loss: 2.3018, Val_Loss: 2.3153
Epoch: 28 Train_Loss: 2.3720, Val_Loss: 2.4467
Epoch: 29 Train_Loss: 2.2131, Val_Loss: 2.2533
Epoch: 30 Train_Loss: 2.2144, Val_Loss: 2.6644
Epoch: 31 Train_Loss: 2.2345, Val_Loss: 2.3036
Epoch: 32 Train_Loss: 2.0849, Val_Loss: 2.2778
Epoch: 33 Train_Loss: 2.1355, Val_Loss: 2.9235
Epoch: 34 Train_Loss: 2.4909, Val_Loss: 2.5539
Epoch: 35 Train_Loss: 2.5277, Val_Loss: 2.3989
Epoch: 36 Train_Loss: 2.6098, Val_Loss: 2.6083
Epoch: 37 Train_Loss: 2.6496, Val_Loss: 2.5968
Epoch: 38 Train_Loss: 2.5559, Val_Loss: 2.6308

The code is so huge that I can not send it here. Just it is an unsupervised learning and I am working with 3d medical images. So, the batch size here is as small as 2. So, my colleague told that I should have expected to see some fluctuations in loss but not actually getting stuck.