Script(s) produce Segmentation Fault (Core Dumped) randomly

I have been getting segmentation faults randomly for a while now. It happens across all my PyTorchs scripts and at random times. Sometimes it is within a few minutes of starting, sometimes it doesn’t fault for a couple of hours. I am also unable to reproduce it. Here’s some relevant system information:

  • Intel i9-7920X
  • 4x Nvidia TitanXP (CUDA 12.6, driver version 560.28.03)
  • torch 2.4.0+cu124
  • Ubuntu 22.04.4 LTS

gdb stacktrace:

Thread 1 "pt_main_thread" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00000000004e0800 in _Py_Dealloc (op=<optimized out>) at /usr/local/src/conda/python-3.8.19/Objects/object.c:2215
#2  _Py_DECREF (op=<optimized out>, lineno=430, filename=0x5f56e0 "/croot/python-split_1710964293155/work/Objects/frameobject.c")
    at /usr/local/src/conda/python-3.8.19/Include/object.h:478
#3  frame_dealloc (f=0x19b2f510) at /usr/local/src/conda/python-3.8.19/Objects/frameobject.c:430
#4  0x00000000004d6f77 in _Py_Dealloc (op=0x19b2f510) at /usr/local/src/conda/python-3.8.19/Objects/object.c:2215
#5  _Py_DECREF (op=0x19b2f510, lineno=4314, filename=0x5f8e50 "/croot/python-split_1710964293155/work/Python/ceval.c") at /usr/local/src/conda/python-3.8.19/Include/object.h:478
#6  _PyEval_EvalCodeWithName (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kwnames=<optimized out>, 
    kwargs=0x7fffffffa790, kwcount=<optimized out>, kwstep=1, defs=0x0, defcount=<optimized out>, kwdefs=0x0, closure=0x0, name=0x7fffa75498b0, qualname=0x7fffa754c530)
    at /usr/local/src/conda/python-3.8.19/Python/ceval.c:4314
#7  0x00000000004e80cc in _PyFunction_Vectorcall (func=<optimized out>, stack=0x7fffffffa780, nargsf=<optimized out>, kwnames=<optimized out>)
    at /usr/local/src/conda/python-3.8.19/Objects/call.c:436
#8  0x00000000004f4ff4 in _PyObject_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=<optimized out>)

Let me know if I should provide anything else