I am training a net. When I train it on the gpu it runs fine other than being slow.
When I train on the GPU which I have used extensively with pytorch otherwise) the training quits abruptly.
I get the cryptic message : “double free or corruption (!prev)”
I have mapped the issue with breakpoints to being during backward().
As I increase batch size it gets worse. At batch size of 1 it may run for almost 1K iterations before crushing.
At batch size= 4 It usually happens with in the first 5 batches. every so often it will make it through 30 iterations before failing.
Anyone deal with something like this before?
looks like it has to do the memory footprint in some way. If I decrease the network size or the input size everything is fine. Even though in general the GPU is no where close to maxing out when looking at msi
Do you know how to use
gdb to get the stack trace of where this happens?
Otherwise, if you have a small code sample that reproduces the error, that would be very helpful.
I’m using pycharm in debug mode - which i believe uses dbg somewhere in there.
I finally got things to work. I found that having very long sequences ( 2000+ time-steps ) caused the issue. It may have been due to my loader. I didn’t inspect it in depth since I found a solution.
It doesn’t repeat itself on simple small models and I can’t really share my production architecture/training procedures.
if you sack a bunch of 1d convs so that you get a model in the 10’s of millions of parameters and try to run it on a small batch of a very long sequence you may get similar errors.
Thanks for getting back to me.