Interrupted by signal 11: SIGSEGV

I’m running a relatively big model on my M1 MacBook and I’m getting this error after a varying number of iterations:

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

I would share code, but it’s all a bit too complicated and entangled to give a clear, minimal example. I think it should be fine though…

You could use gdb via:

gdb --args python args

to isolate the issue further. A minimal code snippet would be indeed helpful, but I also understand that it can be a huge amount of work to create one.

I believe I narrowed down the problem to a code snippet that replaces one Sequential with another Sequential. I have a list of sequentials called sequentials that I initialized in a Module. Periodically, I do sequentials[i] = initialize_new_sequential(). It seems memory isn’t being freed up properly.

What does the backtrace show? Could you post the output here, please?

Sorry, I managed to get rid of the error, but now I get memory errors in the middle of training:

RuntimeError: CUDA out of memory. Tried to allocate 250.00 MiB (GPU 0; 10.92 GiB total capacity; 9.19 GiB already allocated; 45.31 MiB free; 10.10 GiB reserved in total by PyTorch)

Memory is constant throughout training, so this shouldn’t happen. However, I do create new sequentials in the middle of training and replace old ones, as in list_of_sequentials[index] = new_sequential

No matter how much I reduce memory requirements, I get this memory leak

The “list” is a ModuleList, so perhaps the sequentials that are being replaced are registered and some information about them continues to persist?

I don’t know as I’ve never tried this approach. Could you post a minimal, executable code snippet to reproduce the memory leak?