I’m using pytorch in multi gpu setting, and also with some multiprocessing for loading and preprocessing and quite often encountering these errors, which I can’t manage to solve.
malloc.c:4023: _int_malloc: Assertion (unsigned long) (size) >= (unsigned long) (nb)
corrupted size vs. prev_size
Theses appear very randomly too. I thought this was related to https://github.com/pytorch/pytorch/issues/2507, but their solution didn’t work for me :
sudo apt-get install libtcmalloc-minimal4 export LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4"
It seems that this error pops up really often when working with GroupNormalization. I can’t even do a whole epoch actually. I thought it might be due to the multiprocessed nature of my pipeline (I use torch.multiprocessing.Array to comunicate batches between processes), but the fact that the frequency depends on the architecture used is pretty weird to me.
Have you got any idea what might cause this ?