Hi,
I stumbled across a problem recently, when I started to train a network, stored a checkpoint and after loading it again to resume the training, I got a huge spike in the loss. So, I investigated a little bit to figure out what is going on. After debugging, I found out, the network and optimizer are loading correctly.
The problem seems to be somehow connected to the custom Loss function, but I cannot find out what the problem is, as it looks fine to me and training actually works with it.
I created a minimal example to show the problem and created a gist for it: https://gist.github.com/koelscha/172505d7fe3c17b3db84282e6bb5caeb
So, if I set use_weights=False in the loss function, it works as expected. If I set use_weights=True, the weird behavior from above happens - this even happens if the weights are all 1
I am using pytorch version 1.3.0, an NVidia RTX 2080 Ti and CUDA 10.1
Thanks for your help advance,
koelscha