Segfault during torch.save

Getting weird error where a Segmentation Fault is thrown during a call to torch.save on a tensor to a BytesIO() object.

[3]:Fatal Python error: Segmentation fault
[3]:
[3]:Current thread 0x00007fbfdad57740 (most recent call first):
[3]:  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 866 in _save
[3]:  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 632 in save
[3]:  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/filesystem.py", line 262 in _write_item

core dump:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fbfdadf29fc in pthread_kill () from /usr/lib/x86_64-linux-gnu/libc.so.6
(gdb) pt
The history is empty.
(gdb) bt
#0  0x00007fbfdadf29fc in pthread_kill () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fbfdad9e476 in raise () from /usr/lib/x86_64-linux-gnu/libc.so.6
#2  <signal handler called>
#3  0x00007fbfc58f96b4 in crc32_16bytes(void const*, unsigned long, unsigned int) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#4  0x00007fbfc58f9b21 in mz_crc32 () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#5  0x00007fbfc58eb76e in mz_zip_writer_add_mem_ex_v2 () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#6  0x00007fbfc58f59f7 in caffe2::serialize::PyTorchStreamWriter::writeRecord(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void const*, unsigned long, bool) ()
   from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#7  0x00007fbfce496a3f in pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(caffe2::serialize::PyTorchStreamWriter&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::Storage, unsigned long)#202}, void, caffe2::serialize::PyTorchStreamWriter&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::Storage, unsigned long, pybind11::name, pybind11::is_method, pybind11::sibling>(torch::jit::initJITBindings(_object*)::{lambda(caffe2::serialize::PyTorchStreamWriter&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::Storage, unsigned long)#202}&&, void (*)(caffe2::serialize::PyTorchStreamWriter&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::Storage, unsigned long), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so
#8  0x00007fbfce0539c7 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so
#9  0x0000557a21a8310e in ?? ()
#10 0x0000557a21a79a7b in _PyObject_MakeTpCall ()
#11 0x0000557a21a91acb in ?? ()
#12 0x0000557a21a71cfa in _PyEval_EvalFrameDefault ()
#13 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#14 0x0000557a21a6c26d in _PyEval_EvalFrameDefault ()
#15 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#16 0x0000557a21a71cfa in _PyEval_EvalFrameDefault ()
#17 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#18 0x0000557a21a6c26d in _PyEval_EvalFrameDefault ()
#19 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#20 0x0000557a21a6e5d7 in _PyEval_EvalFrameDefault ()
#21 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#22 0x0000557a21a6e5d7 in _PyEval_EvalFrameDefault ()
#23 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#24 0x0000557a21a6c45c in _PyEval_EvalFrameDefault ()
#25 0x0000557a21a917f1 in ?? ()
#26 0x0000557a21a6d53c in _PyEval_EvalFrameDefault ()
#27 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#28 0x0000557a21a6c45c in _PyEval_EvalFrameDefault ()
#29 0x0000557a21a78c14 in _PyObject_FastCallDictTstate ()
#30 0x0000557a21a8da64 in ?? ()
#31 0x0000557a21a79a1c in _PyObject_MakeTpCall ()
#32 0x0000557a21a72096 in _PyEval_EvalFrameDefault ()
#33 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#34 0x0000557a21a71cfa in _PyEval_EvalFrameDefault ()
#35 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#36 0x0000557a21a6c45c in _PyEval_EvalFrameDefault ()
#37 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#38 0x0000557a21a6e5d7 in _PyEval_EvalFrameDefault ()
#39 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#40 0x0000557a21a6e5d7 in _PyEval_EvalFrameDefault ()
#41 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#42 0x0000557a21a6c45c in _PyEval_EvalFrameDefault ()
#43 0x0000557a21a917f1 in ?? ()
#44 0x0000557a21a6d53c in _PyEval_EvalFrameDefault ()
#45 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#46 0x0000557a21a6c45c in _PyEval_EvalFrameDefault ()
#47 0x0000557a21a78c14 in _PyObject_FastCallDictTstate ()
#48 0x0000557a21a8da64 in ?? ()
#49 0x0000557a21a79a1c in _PyObject_MakeTpCall ()
#50 0x0000557a21a72096 in _PyEval_EvalFrameDefault ()
--Type <RET> for more, q to quit, c to continue without paging--
#51 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#52 0x0000557a21a71cfa in _PyEval_EvalFrameDefault ()
#53 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#54 0x0000557a21a6c45c in _PyEval_EvalFrameDefault ()
#55 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#56 0x0000557a21a6c45c in _PyEval_EvalFrameDefault ()
#57 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#58 0x0000557a21a6c45c in _PyEval_EvalFrameDefault ()
#59 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#60 0x0000557a21a6c45c in _PyEval_EvalFrameDefault ()
#61 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#62 0x0000557a21a6c45c in _PyEval_EvalFrameDefault ()
#63 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#64 0x0000557a21a6d53c in _PyEval_EvalFrameDefault ()
#65 0x0000557a21a917f1 in ?? ()
#66 0x0000557a21a6d53c in _PyEval_EvalFrameDefault ()
#67 0x0000557a21a917f1 in ?? ()
#68 0x0000557a21a6d53c in _PyEval_EvalFrameDefault ()
#69 0x0000557a21a917f1 in ?? ()
#70 0x0000557a21a6d53c in _PyEval_EvalFrameDefault ()
#71 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#72 0x0000557a21a6c45c in _PyEval_EvalFrameDefault ()
#73 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#74 0x0000557a21a6c45c in _PyEval_EvalFrameDefault ()
#75 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#76 0x0000557a21a6c45c in _PyEval_EvalFrameDefault ()
#77 0x0000557a21a9193e in ?? ()
#78 0x0000557a21a6e5d7 in _PyEval_EvalFrameDefault ()
#79 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#80 0x0000557a21a71cfa in _PyEval_EvalFrameDefault ()
#81 0x0000557a21a917f1 in ?? ()
#82 0x0000557a21a71cfa in _PyEval_EvalFrameDefault ()
#83 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#84 0x0000557a21a6c45c in _PyEval_EvalFrameDefault ()
#85 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#86 0x0000557a21a6c45c in _PyEval_EvalFrameDefault ()
#87 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#88 0x0000557a21a6c45c in _PyEval_EvalFrameDefault ()
#89 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#90 0x0000557a21a6c45c in _PyEval_EvalFrameDefault ()
#91 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#92 0x0000557a21a6c45c in _PyEval_EvalFrameDefault ()
#93 0x0000557a21a917f1 in ?? ()
#94 0x0000557a21a6d53c in _PyEval_EvalFrameDefault ()
#95 0x0000557a21a9193e in ?? ()
#96 0x0000557a21a6e5d7 in _PyEval_EvalFrameDefault ()
#97 0x0000557a21a917f1 in ?? ()
#98 0x0000557a21a92492 in PyObject_Call ()
#99 0x0000557a21a6e5d7 in _PyEval_EvalFrameDefault ()
#100 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#101 0x0000557a21a71cfa in _PyEval_EvalFrameDefault ()
#102 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#103 0x0000557a21a6c45c in _PyEval_EvalFrameDefault ()
#104 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#105 0x0000557a21a6d53c in _PyEval_EvalFrameDefault ()
#106 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#107 0x0000557a21a6d53c in _PyEval_EvalFrameDefault ()
--Type <RET> for more, q to quit, c to continue without paging--
#108 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#109 0x0000557a21a6e5d7 in _PyEval_EvalFrameDefault ()
#110 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#111 0x0000557a21a6c26d in _PyEval_EvalFrameDefault ()
#112 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#113 0x0000557a21a6d53c in _PyEval_EvalFrameDefault ()
#114 0x0000557a21a917f1 in ?? ()
#115 0x0000557a21a6d53c in _PyEval_EvalFrameDefault ()
#116 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#117 0x0000557a21a6c26d in _PyEval_EvalFrameDefault ()
#118 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#119 0x0000557a21a6c26d in _PyEval_EvalFrameDefault ()
#120 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#121 0x0000557a21a6d53c in _PyEval_EvalFrameDefault ()
#122 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#123 0x0000557a21a6d53c in _PyEval_EvalFrameDefault ()
#124 0x0000557a21a839fc in _PyFunction_Vectorcall ()
#125 0x0000557a21a6c26d in _PyEval_EvalFrameDefault ()
#126 0x0000557a21a689c6 in ?? ()
#127 0x0000557a21b5e256 in PyEval_EvalCode ()
#128 0x0000557a21b89108 in ?? ()
#129 0x0000557a21b829cb in ?? ()
#130 0x0000557a21b88e55 in ?? ()
#131 0x0000557a21b88338 in _PyRun_SimpleFileObject ()
#132 0x0000557a21b87f83 in _PyRun_AnyFileObject ()
#133 0x0000557a21b7aa5e in Py_RunMain ()
#134 0x0000557a21b5102d in Py_BytesMain ()
#135 0x00007fbfdad85d90 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#136 0x00007fbfdad85e40 in __libc_start_main () from /usr/lib/x86_64-linux-gnu/libc.so.6

High level code path is launch training job with elastic launch. Within each process, use fork multiprocess, then attempt to write items to io.BytesIO.

For majority of processes (>90%), it is successful, but for a certain percentage of processes a segfault occurs and the process is never able to join.

ctx = mp.get_context("fork")
p_list.append(
    ctx.Process(
        target=write_preloaded_data,
        args=(write_bucket, local_results_queue, count_queue, s3_client ...),
    )
)
for p in p_list:
    p.start()
...
p.join()


def write_preloaded_data(...):
    with BytesIO() as buffer:
        for write_item, data in bytes_data:
                local_results.append(_write_item(stream, data, write_item, storage_key))
        for write_item, tensor in tensor_data:
                assert tensor.is_cpu
                local_results.append(_write_item(stream, tensor, write_item, storage_key)) <--- segfault here

_write_item is from pytorch/torch/distributed/checkpoint/filesystem.py at v2.4.0 · pytorch/pytorch · GitHub

Any guidance or tips on how to further debug would be appreciated.

pip show torch
Name: torch
Version: 2.4.0a0+3bcc3cddb5.nv24.7

resolved, was due to tensor being garbage collected before write finished

How did you narrow it down and how did you fix it?