I’m training a deep CNN based model with a fraction of data that should fit into memory. Everytime after a few steps, the whole thread is terminated as “Killed 9” in my local machine, which happens during loss.backward().
Can someone give me idea about what the problem is? Or how to debug backward()?
If no one you know killed the process, it’s possible that the kernel terminated it due to memory constraints. You could try watching the memory consumption with something like ps u and verifying if the memory is actually increasing to unmanageable amounts.