Fatal Python error: deallocating None

chilango · January 9, 2018, 10:57pm

I’m getting random jobs aborted when training an LSTM in 2 GPUs, with the following short, cryptic error:

Fatal Python error: deallocating None

Thread 0x00007f260ec2d700 (most recent call first):

Thread 0x00007f260f7c5700 (most recent call first):

Thread 0x00007f260ffc6700 (most recent call first):

Current thread 0x00007f266b197740 (most recent call first):
 File "model.py", line 188 in <module>
Aborted (core dumped)

Line 188 is the training function. Some jobs finish correctly. Some fail like above.
Linux AWS 1044 (Ubuntu 16.04.3)
AWS 8 GPU instance (Tesla K80)
python3.6
pytorch 0.3.0
CUDA 9.0.176
cuDNN 7.0.3

I’m trying to retrieve the kernel core dump to see if there are any clues there. This happens while doing grid-search on hidden size, and seems unrelated to the parameter value.

Not entirely sure how to reproduce it, I’ll report back if I’m able to figure something out. A google search shows scattered instances of it that seem to me point to malloc/free. Way over my head.

richard · January 10, 2018, 1:10am

A gdb backtrace could help as well. You could do:

gdb python
catch throw
run test.py

where test.py is your python script. When gdb catches the exception, typing backtrace will give a backtrace.

chilango · January 11, 2018, 12:08am

Great suggestion, thanks for the detailed steps. This is the result:

Fatal Python error: deallocating None

Thread 0x00007fff9da36700 (most recent call first):

Thread 0x00007fff9eb78700 (most recent call first):

Thread 0x00007fff9e377700 (most recent call first):

Current thread 0x00007ffff7fd8740 (most recent call first):
  File "model.py", line 188 in <module>

Thread 1 "python" received signal SIGABRT, Aborted.
0x00007ffff760a428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) backtrace
#0  0x00007ffff760a428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ffff760c02a in __GI_abort () at abort.c:89
#2  0x00005555555b1e4f in Py_FatalError ()
#3  0x0000555555613817 in buffered_flush.cold ()
#4  0x00005555556603aa in _PyCFunction_FastCallDict ()
#5  0x000055555566077f in _PyObject_FastCallDict ()
#6  0x00005555556eeb6f in _PyObject_CallMethodId_SizeT ()
#7  0x00005555556603aa in _PyCFunction_FastCallDict ()
#8  0x000055555566077f in _PyObject_FastCallDict ()
#9  0x00005555556f0faf in _PyObject_CallMethodId ()
#10 0x0000555555759a98 in flush_std_files ()
#11 0x00005555555b1e49 in Py_FatalError ()
#12 0x0000555555642048 in dict_dealloc ()
#13 0x00005555556f3e3e in subtype_dealloc ()
#14 0x0000555555641820 in _PyTrash_thread_destroy_chain ()
#15 0x00005555556ecfeb in fast_function ()
#16 0x00005555556f2f95 in call_function ()
#17 0x000055555571462a in _PyEval_EvalFrameDefault ()
#18 0x00005555556ed8d9 in PyEval_EvalCodeEx ()
#19 0x00005555556ee67c in PyEval_EvalCode ()
#20 0x0000555555768ce4 in run_mod ()
#21 0x00005555557690e1 in PyRun_FileExFlags ()
#22 0x00005555557692e4 in PyRun_SimpleFileExFlags ()
#23 0x000055555576cdaf in Py_Main ()
#24 0x00005555556338be in main ()
(gdb)

Any thoughts? I’ll keep on digging on my end. This is now happening in every pair of GPUs in which I’m running a separate job (not just randomly).

senmao · August 1, 2018, 3:39am

Hi Have you figured out this problem? I met the same situation.

Thx

chilango · August 1, 2018, 2:59pm

Nope, not directly. Haven’t worked on this for a few months, so my versions back then are probably obsolete by now. I hacked my way out by saving the model and other relevant params every successful epoch, and the restart training from the last one. Ugly but it works. Sorry you’re facing this, it’s a nasty, uninformative bug

senmao · August 2, 2018, 4:28am

Thanks for your reply, would you mind tell me your data volume and gpu parallelism strategy(Dataparallel or DistributedDataParallel)?

best

chilango · August 2, 2018, 4:12pm

I used nn.DataParallel but it depends on what you need.