Network behaves differently after storing and loading

koelscha · October 26, 2019, 4:34pm

Hi,
I stumbled across a problem recently, when I started to train a network, stored a checkpoint and after loading it again to resume the training, I got a huge spike in the loss. So, I investigated a little bit to figure out what is going on. After debugging, I found out, the network and optimizer are loading correctly.

The problem seems to be somehow connected to the custom Loss function, but I cannot find out what the problem is, as it looks fine to me and training actually works with it.

I created a minimal example to show the problem and created a gist for it: https://gist.github.com/koelscha/172505d7fe3c17b3db84282e6bb5caeb

So, if I set use_weights=False in the loss function, it works as expected. If I set use_weights=True, the weird behavior from above happens - this even happens if the weights are all 1

I am using pytorch version 1.3.0, an NVidia RTX 2080 Ti and CUDA 10.1

Thanks for your help advance,
koelscha

koelscha · October 27, 2019, 4:10pm

I think I found out, what the problem was. However, it’s incredible that this could happen without even a warning being printed.
It seems, pytorch has a problem when multiplying two slices of tensors with different datatype. If I call .contiguous() on the tensors before multiplying, everything works fine. Minimal example: https://gist.github.com/koelscha/4dc9255a83e39b36b1cf20afa3c6fa74

I would expect to at least see a warning that this causes problems, since it works fine in numpy