Pytorch generate coredump files in the training process

Hi, all
We are using docker container of CentOS7, CUDA11.7, RTX3090GPU. our env is:
python3.8
torch1.13.1+cu117
librosa==0.10.1

we are training a model with DistributedDataParallel,

The training process is running , but there will sometimes generate a coredump file locally. those coredump file are generated but the training process keeps going. We do not know why. Has any body got such problem?

Here is the gdb view of core files:

warning: core file may not match specified executable file.
[New LWP 46065]
[New LWP 45916]
[New LWP 46067]
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/usr/lib64/libthread_db.so.1”.
Missing separate debuginfo for /opt/msxf/miniconda3/envs/py38/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
Try: yum --enablerepo=‘debug’ install /usr/lib/debug/.build-id/5f/4fb88af97be3ecacc71363136bb015b2a07119.debug
Missing separate debuginfo for /usr/lib64/libcuda.so.1
Try: yum --enablerepo=‘debug’ install /usr/lib/debug/.build-id/76/f6131fcbec6e7249b0eb49f89bb1de3c816f71.debug
Missing separate debuginfo for /opt/msxf/miniconda3/envs/py38/lib/python3.8/site-packages/numpy/core/…/…/numpy.libs/libgfortran-040039e1.so.5.0.0
Try: yum --enablerepo=‘debug’ install /usr/lib/debug/.build-id/5b/be74eb6855e0a2c043c0bec2f484bf3e9f14c0.debug
Missing separate debuginfo for /opt/msxf/miniconda3/envs/py38/lib/python3.8/site-packages/numpy/core/…/…/numpy.libs/libquadmath-96973f99.so.0.0.0
Try: yum --enablerepo=‘debug’ install /usr/lib/debug/.build-id/54/9b4c82347785459571c79239872ad31509dcf4.debug
Core was generated by /opt/msxf/miniconda3/envs/py38/bin/python -c from multiprocessing.spawn import'. Program terminated with signal 6, Aborted. #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. warning: File "/usr/lib64/libstdc++.so.6.0.25-gdb.py" auto-loading has been declined by your auto-load safe-path’ set to “$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py”.
To enable execution of this file add
add-auto-load-safe-path /usr/lib64/libstdc++.so.6.0.25-gdb.py
line to your configuration file “/home/Data/.gdbinit”.
To completely disable this security protection add
set auto-load safe-path /
line to your configuration file “/home/Data/.gdbinit”.
For more information about this security protection see the
“Auto-loading safe path” section in the GDB manual. E.g., run from the shell:
info “(gdb)Auto-loading safe path”
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at …/sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007fc998f6a8e4 in __GI_abort () at abort.c:79
#2 0x00007fc978b85af3 in ?? () from /usr/lib64/libstdc++.so.6
#3 0x00007fc978b8bc76 in ?? () from /usr/lib64/libstdc++.so.6
#4 0x00007fc978b8bcb1 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5 0x00007fc978b8b6d4 in __gxx_personality_v0 () from /usr/lib64/libstdc++.so.6
#6 0x00007fc978e8d6a8 in _Unwind_ForcedUnwind_Phase2 (exc=exc@entry=0x7fc8a2403d70, context=context@entry=0x7fc8a24017d0,
frames_p=frames_p@entry=0x7fc8a24016d8) at …/…/…/libgcc/unwind.inc:182
#7 0x00007fc978e8dc9c in _Unwind_ForcedUnwind (exc=0x7fc8a2403d70, stop=stop@entry=0x7fc999312760 <unwind_stop>,
stop_argument=) at …/…/…/libgcc/unwind.inc:217
#8 0x00007fc9993128d0 in __GI___pthread_unwind (buf=) at unwind.c:121
#9 0x00007fc99930ae05 in __do_cancel () at pthreadP.h:301
#10 __pthread_exit (value=) at pthread_exit.c:28
#11 0x000055b4ba089909 in PyThread_exit_thread ()
at /tmp/build/80754af9/python_1599203911753/work/Python/thread_pthread.h:357
#12 0x000055b4b9f1cd3a in exit_thread_if_finalizing (runtime=0x55b4ba2143a0 <_PyRuntime>, tstate=0x55b4c176c3d0)
at /tmp/build/80754af9/python_1599203911753/work/Python/ceval.c:246
#13 PyEval_RestoreThread (tstate=0x55b4c176c3d0) at /tmp/build/80754af9/python_1599203911753/work/Python/ceval.c:400
#14 0x00007fc97696f0d5 in THPStorage_shareFd(_object*, _object*) ()
from /opt/msxf/miniconda3/envs/py38/lib/python3.8/site-packages/torch/lib/libtorch_python.so
#15 0x000055b4ba010e6a in method_vectorcall_NOARGS ()
at /tmp/build/80754af9/python_1599203911753/work/Objects/descrobject.c:393
#16 0x000055b4b9fa175e in _PyObject_Vectorcall (kwnames=0x0, nargsf=, args=0x7fc91c1683d0,
callable=0x7fc997f806d0) at /tmp/build/80754af9/python_1599203911753/work/Include/cpython/abstract.h:127
#17 call_function (kwnames=0x0, oparg=, pp_stack=, tstate=0x55b4c176c3d0)
at /tmp/build/80754af9/python_1599203911753/work/Python/ceval.c:4963
#18 _PyEval_EvalFrameDefault (f=, throwflag=)
at /tmp/build/80754af9/python_1599203911753/work/Python/ceval.c:3486
#19 0x000055b4ba02c86b in function_code_fastcall (globals=, nargs=1, args=, co=)
at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:283
#20 _PyFunction_Vectorcall.localalias.355 () at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:410
#21 0x000055b4b9f1f2d6 in _PyObject_Vectorcall (kwnames=0x0, nargsf=1, args=0x7fc8a2401c10, callable=0x7fc8a6d2a280)
at /tmp/build/80754af9/python_1599203911753/work/Include/cpython/abstract.h:127
#22 _PyObject_FastCall () at /tmp/build/80754af9/python_1599203911753/work/Include/cpython/abstract.h:147
—Type to continue, or q to quit—
#23 object_vacall (base=0x0, callable=0x7fc8a6d2a280, vargs=0x7fc8a2401c70)
at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:1186
#24 0x000055b4b9fdee1e in PyObject_CallFunctionObjArgs (callable=)
at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:1259
#25 0x00007fc99804cdd2 in _Pickle_FastCall (func=, obj=0x7fc91c17a9c0)
at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:362
#26 0x00007fc998044629 in save () at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:4409
#27 0x00007fc998042eec in store_tuple_elements (len=, t=, self=)
at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:2760
#28 save_tuple (obj=0x7fc91c4dfdc0, self=0x7fc91c1b7dc0) at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:2811
#29 save () at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:4339
#30 0x00007fc99804629b in save_reduce () at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:4194
#31 0x00007fc998043698 in save () at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:4472
#32 0x00007fc998042eec in store_tuple_elements (len=, t=, self=)
at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:2760
#33 save_tuple (obj=0x7fc91c16d840, self=0x7fc91c1b7dc0) at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:2811
#34 save () at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:4339
#35 0x00007fc99804629b in save_reduce () at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:4194
#36 0x00007fc998043698 in save () at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:4472
#37 0x00007fc99803c09c in store_tuple_elements (len=, t=, self=)
at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:2760
#38 save_tuple (obj=0x7fc91c42b6a0, self=0x7fc91c1b7dc0) at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:2838
#39 save (self=0x7fc91c1b7dc0, obj=0x7fc91c42b6a0, pers_save=0) at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:4339
#40 0x00007fc998042eec in store_tuple_elements (len=, t=, self=)
at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:2760
#41 save_tuple (obj=0x7fc91c16da80, self=0x7fc91c1b7dc0) at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:2811
#42 save () at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:4339
#43 0x00007fc9980453fd in dump () at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:4519
#44 0x00007fc9980457b2 in _pickle_Pickler_dump () at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:4590
#45 0x000055b4ba0111cd in method_vectorcall_O () at /tmp/build/80754af9/python_1599203911753/work/Objects/descrobject.c:416
#46 0x000055b4b9fa175e in _PyObject_Vectorcall (kwnames=0x0, nargsf=, args=0x7fc91c161fd0,
callable=0x7fc9980e3ae0) at /tmp/build/80754af9/python_1599203911753/work/Include/cpython/abstract.h:127
#47 call_function (kwnames=0x0, oparg=, pp_stack=, tstate=0x55b4c176c3d0)
at /tmp/build/80754af9/python_1599203911753/work/Python/ceval.c:4963
—Type to continue, or q to quit—
#48 _PyEval_EvalFrameDefault (f=, throwflag=)
at /tmp/build/80754af9/python_1599203911753/work/Python/ceval.c:3486
#49 0x000055b4ba02ba92 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1599203911753/work/Python/ceval.c:4298
#50 0x000055b4ba02cd20 in _PyFunction_Vectorcall (kwnames=, nargsf=, stack=0x7fc81c001198,
func=0x7fc997fc1c10) at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:435
#51 _PyObject_Vectorcall (kwnames=, nargsf=, args=0x7fc81c001198, callable=0x7fc997fc1c10)
at /tmp/build/80754af9/python_1599203911753/work/Include/cpython/abstract.h:127
#52 method_vectorcall () at /tmp/build/80754af9/python_1599203911753/work/Objects/classobject.c:60
#53 0x000055b4b9fa177f in _PyObject_Vectorcall (kwnames=0x0, nargsf=, args=0x7fc81c0011a0,
callable=0x7fc8235f0440) at /tmp/build/80754af9/python_1599203911753/work/Include/cpython/abstract.h:127
#54 call_function (kwnames=0x0, oparg=, pp_stack=, tstate=0x55b4c176c3d0)
at /tmp/build/80754af9/python_1599203911753/work/Python/ceval.c:4963
#55 _PyEval_EvalFrameDefault (f=, throwflag=)
at /tmp/build/80754af9/python_1599203911753/work/Python/ceval.c:3469
#56 0x000055b4ba02c86b in function_code_fastcall (globals=, nargs=8, args=, co=)
at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:283
#57 _PyFunction_Vectorcall.localalias.355 () at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:410
#58 0x000055b4b9fde041 in PyVectorcall_Call () at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:199
#59 0x000055b4b9fde1ae in PyObject_Call () at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:227
#60 0x000055b4ba06399b in do_call_core (kwdict=0x7fc91c1728c0, callargs=0x7fc91c427270, func=0x7fc8259ece50,
tstate=) at /tmp/build/80754af9/python_1599203911753/work/Python/ceval.c:5010
#61 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1599203911753/work/Python/ceval.c:3559
#62 0x000055b4ba02c86b in function_code_fastcall (globals=, nargs=1, args=, co=)
at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:283
#63 _PyFunction_Vectorcall.localalias.355 () at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:410
#64 0x000055b4b9fa175e in _PyObject_Vectorcall (kwnames=0x0, nargsf=, args=0x7fc8a5963bb8,
callable=0x7fc9980f6c10) at /tmp/build/80754af9/python_1599203911753/work/Include/cpython/abstract.h:127
#65 call_function (kwnames=0x0, oparg=, pp_stack=, tstate=0x55b4c176c3d0)
at /tmp/build/80754af9/python_1599203911753/work/Python/ceval.c:4963
#66 _PyEval_EvalFrameDefault (f=, throwflag=)
at /tmp/build/80754af9/python_1599203911753/work/Python/ceval.c:3486
#67 0x000055b4ba02c86b in function_code_fastcall (globals=, nargs=1, args=, co=)
at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:283
#68 _PyFunction_Vectorcall.localalias.355 () at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:410
—Type to continue, or q to quit—
#69 0x000055b4b9fa175e in _PyObject_Vectorcall (kwnames=0x0, nargsf=, args=0x7fc91c161c38,
callable=0x7fc9980f6ee0) at /tmp/build/80754af9/python_1599203911753/work/Include/cpython/abstract.h:127
#70 call_function (kwnames=0x0, oparg=, pp_stack=, tstate=0x55b4c176c3d0)
at /tmp/build/80754af9/python_1599203911753/work/Python/ceval.c:4963
#71 _PyEval_EvalFrameDefault (f=, throwflag=)
at /tmp/build/80754af9/python_1599203911753/work/Python/ceval.c:3486
#72 0x000055b4ba02c86b in function_code_fastcall (globals=, nargs=1, args=, co=)
at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:283
#73 _PyFunction_Vectorcall.localalias.355 () at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:410
#74 0x000055b4ba02cee7 in _PyObject_Vectorcall (kwnames=0x0, nargsf=1, args=0x7fc8a2402e08, callable=0x7fc9980f6ca0)
at /tmp/build/80754af9/python_1599203911753/work/Include/cpython/abstract.h:127
#75 method_vectorcall () at /tmp/build/80754af9/python_1599203911753/work/Objects/classobject.c:67
#76 0x000055b4b9fde041 in PyVectorcall_Call () at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:199
#77 0x000055b4b9fde1ae in PyObject_Call () at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:227
#78 0x000055b4ba0d68be in t_bootstrap () at /tmp/build/80754af9/python_1599203911753/work/Modules/_threadmodule.c:1002
#79 0x000055b4ba083708 in pythread_wrapper (arg=)
at /tmp/build/80754af9/python_1599203911753/work/Python/thread_pthread.h:232
#80 0x00007fc999309be0 in start_thread (arg=) at pthread_create.c:486
#81 0x00007fc99903cf5f in clone () at …/sysdeps/unix/sysv/linux/x86_64/clone.S:95