Malloc, and corrupted size errors

Felix_Lessange · February 13, 2019, 10:40am

Hi.
I’m using pytorch in multi gpu setting, and also with some multiprocessing for loading and preprocessing and quite often encountering these errors, which I can’t manage to solve.

malloc.c:4023: _int_malloc: Assertion (unsigned long) (size) >= (unsigned long) (nb)
corrupted size vs. prev_size

Theses appear very randomly too. I thought this was related to https://github.com/pytorch/pytorch/issues/2507, but their solution didn’t work for me :

sudo apt-get install libtcmalloc-minimal4 
export LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4"

It seems that this error pops up really often when working with GroupNormalization. I can’t even do a whole epoch actually. I thought it might be due to the multiprocessed nature of my pipeline (I use torch.multiprocessing.Array to comunicate batches between processes), but the fact that the frequency depends on the architecture used is pretty weird to me.

Have you got any idea what might cause this ?

python 3.7
torch 1.0.0

saransh_karira · August 11, 2020, 5:41am

Hey @Felix_Lessange ! Did you resolve it? If yes, then how??

Felix_Lessange · December 23, 2020, 9:47pm

Hey. Sorry for late reply. I never saw this in 2020. I’m guessing some silent driver / pytorch / kernel update ?