"double free or corruption (!prev)" when using gpu. fine when using cpu

Dan_Erez · December 25, 2019, 3:49pm

I am training a net. When I train it on the gpu it runs fine other than being slow.
When I train on the GPU which I have used extensively with pytorch otherwise) the training quits abruptly.

I get the cryptic message : “double free or corruption (!prev)”
I have mapped the issue with breakpoints to being during backward().

As I increase batch size it gets worse. At batch size of 1 it may run for almost 1K iterations before crushing.
At batch size= 4 It usually happens with in the first 5 batches. every so often it will make it through 30 iterations before failing.

Anyone deal with something like this before?

Dan_Erez · December 25, 2019, 4:26pm

looks like it has to do the memory footprint in some way. If I decrease the network size or the input size everything is fine. Even though in general the GPU is no where close to maxing out when looking at msi

albanD · December 27, 2019, 9:10am

Do you know how to use gdb to get the stack trace of where this happens?
Otherwise, if you have a small code sample that reproduces the error, that would be very helpful.

Dan_Erez · January 1, 2020, 4:14pm

I’m using pycharm in debug mode - which i believe uses dbg somewhere in there.
I finally got things to work. I found that having very long sequences ( 2000+ time-steps ) caused the issue. It may have been due to my loader. I didn’t inspect it in depth since I found a solution.
It doesn’t repeat itself on simple small models and I can’t really share my production architecture/training procedures.

if you sack a bunch of 1d convs so that you get a model in the 10’s of millions of parameters and try to run it on a small batch of a very long sequence you may get similar errors.

Thanks for getting back to me.