Seg Fault with Pytorch Lightning

Hi all, hope you’re well. I’m running a script with pytorch lightning and keep getting the below Segmentation Fault error. I really have no idea what’s going on/how to address it - I imported faulthandler to get a better sense of what’s causing the issue and that output is pasted below. Would appreciate any help on getting this to work.

Fatal Python error: Segmentation fault

Current thread 0x00007f08d3c82740 (most recent call first):
File “”, line 228 in _call_with_frames_removed
File “”, line 1173 in create_module
File “”, line 565 in module_from_spec
File “”, line 666 in _load_unlocked
File “”, line 986 in _find_and_load_unlocked
File “”, line 1007 in _find_and_load
File “”, line 228 in _call_with_frames_removed
File “”, line 1058 in _handle_fromlist
File “/wynton/protected/home/ichs/dmandair/anaconda3/envs/ONC_EXP/lib/python3.9/site-packages/google/protobuf/descriptor.py”, line 47 in
File “”, line 228 in _call_with_frames_removed
File “”, line 850 in exec_module
File “”, line 680 in _load_unlocked
File “”, line 986 in _find_and_load_unlocked
File “”, line 1007 in _find_and_load
File “”, line 228 in _call_with_frames_removed
File “”, line 1058 in _handle_fromlist
File “/wynton/protected/home/ichs/dmandair/anaconda3/envs/ONC_EXP/lib/python3.9/site-packages/tensorflow/core/framework/function_pb2.py”, line 7 in
File “”, line 228 in _call_with_frames_removed
File “”, line 850 in exec_module
File “”, line 680 in _load_unlocked
File “”, line 986 in _find_and_load_unlocked
File “”, line 1007 in _find_and_load
File “”, line 228 in _call_with_frames_removed
File “”, line 1058 in _handle_fromlist
File “/wynton/protected/home/ichs/dmandair/anaconda3/envs/ONC_EXP/lib/python3.9/site-packages/tensorflow/python/eager/context.py”, line 32 in
File “”, line 228 in _call_with_frames_removed
File “”, line 850 in exec_module
File “”, line 680 in _load_unlocked
File “”, line 986 in _find_and_load_unlocked
File “”, line 1007 in _find_and_load
File “”, line 228 in _call_with_frames_removed
File “”, line 1058 in _handle_fromlist
File “/wynton/protected/home/ichs/dmandair/anaconda3/envs/ONC_EXP/lib/python3.9/site-packages/tensorflow/python/init.py”, line 41 in
File “”, line 228 in _call_with_frames_removed
File “”, line 850 in exec_module
File “”, line 680 in _load_unlocked
File “”, line 986 in _find_and_load_unlocked
File “”, line 1007 in _find_and_load
File “”, line 228 in _call_with_frames_removed
File “”, line 972 in _find_and_load_unlocked
File “”, line 1007 in _find_and_load
File “/wynton/protected/home/ichs/dmandair/anaconda3/envs/ONC_EXP/lib/python3.9/site-packages/tensorflow/init.py”, line 41 in
File “”, line 228 in _call_with_frames_removed
File “”, line 850 in exec_module
File “”, line 680 in _load_unlocked
File “”, line 986 in _find_and_load_unlocked
File “”, line 1007 in _find_and_load
File “/wynton/protected/home/ichs/dmandair/anaconda3/envs/ONC_EXP/lib/python3.9/site-packages/huggingface_hub/keras_mixin.py”, line 19 in
File “”, line 228 in _call_with_frames_removed
File “”, line 850 in exec_module
File “”, line 680 in _load_unlocked
File “”, line 986 in _find_and_load_unlocked
File “”, line 1007 in _find_and_load
File “/wynton/protected/home/ichs/dmandair/anaconda3/envs/ONC_EXP/lib/python3.9/site-packages/huggingface_hub/init.py”, line 37 in
File “”, line 228 in _call_with_frames_removed
File “”, line 850 in exec_module
File “”, line 680 in _load_unlocked
File “”, line 986 in _find_and_load_unlocked
File “”, line 1007 in _find_and_load
File “/wynton/protected/home/ichs/dmandair/anaconda3/envs/ONC_EXP/lib/python3.9/site-packages/transformers/file_utils.py”, line 51 in
File “”, line 228 in _call_with_frames_removed
File “”, line 850 in exec_module
File “”, line 680 in _load_unlocked
File “”, line 986 in _find_and_load_unlocked
File “”, line 1007 in _find_and_load
File “/wynton/protected/home/ichs/dmandair/anaconda3/envs/ONC_EXP/lib/python3.9/site-packages/transformers/dependency_versions_check.py”, line 36 in
File “”, line 228 in _call_with_frames_removed
File “”, line 850 in exec_module
File “”, line 680 in _load_unlocked
File “”, line 986 in _find_and_load_unlocked
File “”, line 1007 in _find_and_load
File “”, line 228 in _call_with_frames_removed
File “”, line 1058 in _handle_fromlist
File “/wynton/protected/home/ichs/dmandair/anaconda3/envs/ONC_EXP/lib/python3.9/site-packages/transformers/init.py”, line 43 in
File “”, line 228 in _call_with_frames_removed
File “”, line 850 in exec_module
File “”, line 680 in _load_unlocked
File “”, line 986 in _find_and_load_unlocked
File “”, line 1007 in _find_and_load
File “/wynton/protected/home/ichs/dmandair/anaconda3/envs/ONC_EXP/lib/python3.9/site-packages/torchmetrics/functional/text/bert.py”, line 28 in
File “”, line 228 in _call_with_frames_removed
File “”, line 850 in exec_module
File “”, line 680 in _load_unlocked
File “”, line 986 in _find_and_load_unlocked
File “”, line 1007 in _find_and_load
File “/wynton/protected/home/ichs/dmandair/anaconda3/envs/ONC_EXP/lib/python3.9/site-packages/torchmetrics/functional/init.py”, line 68 in
File “”, line 228 in _call_with_frames_removed
File “”, line 850 in exec_module
File “”, line 680 in _load_unlocked
File “”, line 986 in _find_and_load_unlocked
File “”, line 1007 in _find_and_load
File “”, line 228 in _call_with_frames_removed
File “”, line 1058 in _handle_fromlist
File “/wynton/protected/home/ichs/dmandair/anaconda3/envs/ONC_EXP/lib/python3.9/site-packages/torchmetrics/init.py”, line 14 in
File “”, line 228 in _call_with_frames_removed
File “”, line 850 in exec_module
File “”, line 680 in _load_unlocked
File “”, line 986 in _find_and_load_unlocked
File “”, line 1007 in _find_and_load
File “/wynton/protected/home/ichs/dmandair/anaconda3/envs/ONC_EXP/lib/python3.9/site-packages/pytorch_lightning/utilities/types.py”, line 25 in

test.sh: line 6: 9413 Segmentation fault python end_to_end_attention.py

One first step could be to start this in gdb and get a backtrace of the segfault (gdb -ex run --args python3 foo.py and when it says “segfault” do bt and capture the output).
This will at least show you which bits of C++ are involved.
Quite likely this is not lightning itself (which I think is pure Python) but rather that Lightning loads some auxiliary library which is (or libraries are mutually) incompatible.

Best regards

Thomas

Thanks for the reply! I did what you suggested and the output is below. Somewhat beyond me but do you maybe see what the issue is?

Program received signal SIGSEGV, Segmentation fault.
0x00007ffefda24695 in google::protobuf::python::AddEnumValues(_typeobject*, google::protobuf::EnumDescriptor const*) [clone .isra.0] [clone .constprop.0] ()
from /wynton/protected/home/ichs/dmandair/anaconda3/envs/ONC_EXP/lib/python3.9/site-packages/google/protobuf/pyext/_message.cpython-39-x86_64-linux-gnu.so
Missing separate debuginfos, use: debuginfo-install glibc-2.17-325.el7_9.x86_64 nvidia-driver-branch-470-cuda-libs-470.82.01-1.el7.x86_64
(gdb) bt
#0 0x00007ffefda24695 in google::protobuf::python::AddEnumValues(_typeobject*, google::protobuf::EnumDescriptor const*) [clone .isra.0] [clone .constprop.0] ()
from /wynton/protected/home/ichs/dmandair/anaconda3/envs/ONC_EXP/lib/python3.9/site-packages/google/protobuf/pyext/_message.cpython-39-x86_64-linux-gnu.so
#1 0x00007ffefda247f2 in google::protobuf::python::InitDescriptor() ()
from /wynton/protected/home/ichs/dmandair/anaconda3/envs/ONC_EXP/lib/python3.9/site-packages/google/protobuf/pyext/_message.cpython-39-x86_64-linux-gnu.so
#2 0x00007ffefda34942 in google::protobuf::python::InitProto2MessageModule(_object*) ()
from /wynton/protected/home/ichs/dmandair/anaconda3/envs/ONC_EXP/lib/python3.9/site-packages/google/protobuf/pyext/_message.cpython-39-x86_64-linux-gnu.so
#3 0x00007ffefda3c985 in PyInit__message ()
from /wynton/protected/home/ichs/dmandair/anaconda3/envs/ONC_EXP/lib/python3.9/site-packages/google/protobuf/pyext/_message.cpython-39-x86_64-linux-gnu.so
#4 0x00005555557b95d7 in _PyImport_LoadDynamicModuleWithSpec (fp=0x0, spec=0x7ffefda68610)
at /tmp/build/80754af9/python-split_1631797238431/work/Python/importdl.c:164
#5 _imp_create_dynamic_impl.isra.21 (file=, spec=0x7ffefda68610) at /tmp/build/80754af9/python-split_1631797238431/work/Python/import.c:2300
#6 _imp_create_dynamic () at /tmp/build/80754af9/python-split_1631797238431/work/Python/clinic/import.c.h:330
#7 0x00005555556aada1 in cfunction_vectorcall_FASTCALL () at /tmp/build/80754af9/python-split_1631797238431/work/Objects/methodobject.c:430
#8 0x000055555568c8f8 in PyVectorcall_Call (kwargs=0x1, tuple=0x7ffefda680d0, callable=0x7ffff7f7bb80)
at /tmp/build/80754af9/python-split_1631797238431/work/Objects/call.c:231
#9 _PyObject_Call () at /tmp/build/80754af9/python-split_1631797238431/work/Objects/call.c:266
#10 0x0000555555723b8e in PyObject_Call (kwargs=0x7ffefda65e00, args=0x7ffefda680d0, callable=0x7ffff7f7bb80)
at /tmp/build/80754af9/python-split_1631797238431/work/Objects/call.c:293
#11 do_call_core (kwdict=0x7ffefda65e00, callargs=0x7ffefda680d0, func=0x7ffff7f7bb80, tstate=)
at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:5095
#12 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:3580
#13 0x00005555556d68e2 in _PyEval_EvalFrame () at /tmp/build/80754af9/python-split_1631797238431/work/Include/internal/pycore_ceval.h:40
#14 _PyEval_EvalCode () at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:4327
#15 0x00005555556d7527 in _PyFunction_Vectorcall () at /tmp/build/80754af9/python-split_1631797238431/work/Objects/call.c:396
#16 0x000055555564dfdc in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=, args=0x7ffefdf40fd0, callable=0x7ffff7f7f3a0, tstate=)
at /tmp/build/80754af9/python-split_1631797238431/work/Include/cpython/abstract.h:118
#17 PyObject_Vectorcall () at /tmp/build/80754af9/python-split_1631797238431/work/Include/cpython/abstract.h:127
#18 call_function (kwnames=0x0, oparg=, pp_stack=, tstate=0x555555914b30)
at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:5075
#19 _PyEval_EvalFrameDefault (tstate=, f=0x7ffefdf40e40, throwflag=)
at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:3487
#20 0x00005555556d7753 in _PyEval_EvalFrame () at /tmp/build/80754af9/python-split_1631797238431/work/Include/internal/pycore_ceval.h:40
#21 function_code_fastcall (globals=, nargs=, args=, co=, tstate=0x555555914b30)
at /tmp/build/80754af9/python-split_1631797238431/work/Objects/call.c:330

Seems like “something with protobuf” to me…

This is probably obvious to most people stumbling upon this in the future, but I’m just going to add it regardless:
pytorch-lightning seems to be incompatible with protobuf v3.19.*. Everything seems to work with 3.18.1