Backward() killed after a few steps

I’m training a deep CNN based model with a fraction of data that should fit into memory. Everytime after a few steps, the whole thread is terminated as “Killed 9” in my local machine, which happens during loss.backward().

Can someone give me idea about what the problem is? Or how to debug backward()?

Thanks!

1 Like

maybe you can post your code and provide more information for people to help you.

If no one you know killed the process, it’s possible that the kernel terminated it due to memory constraints. You could try watching the memory consumption with something like ps u and verifying if the memory is actually increasing to unmanageable amounts.

1 Like

Adding on to what @richard said, signal 9 is SIGKILL and is likely caused by OOM.