Backward() killed after a few steps

evelynz · November 25, 2017, 8:17pm

I’m training a deep CNN based model with a fraction of data that should fit into memory. Everytime after a few steps, the whole thread is terminated as “Killed 9” in my local machine, which happens during loss.backward().

Can someone give me idea about what the problem is? Or how to debug backward()?

Thanks!

jdhao · November 27, 2017, 3:04pm

maybe you can post your code and provide more information for people to help you.

richard · November 27, 2017, 6:33pm

If no one you know killed the process, it’s possible that the kernel terminated it due to memory constraints. You could try watching the memory consumption with something like ps u and verifying if the memory is actually increasing to unmanageable amounts.

SimonW · November 27, 2017, 6:38pm

Adding on to what @richard said, signal 9 is SIGKILL and is likely caused by OOM.