Segmentation Fault (SIGSEGV) on ROCM 5.3.0, Python 3.10

I basically want to get pytorch working for InvokeAI, but after digging down I found pytorch crashes with a sigsegv.

Within the InvokeAI stack, the segfault is emitted when a model is sent with model.to, similar to what I’m getting with the example below.

Unfortunately, with InvokeAI’s requirements, downgrading Python isn’t possible.

As you can see I’ve also tried using the pytorch nightlies. Now seeing if I can fix this issue myself by compiling and installing pytorch from source. Is there a flag I’m missing? Is anyone else getting this?

$ cat test_pytorch.py
import torch
assert torch.cuda.is_available() and torch.version.hip
print(f"Running with device: {torch.cuda.get_device_name(torch.cuda.current_device())}")
t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')
$
$
$ python test_pytorch.py
/home/user/invokeai/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py:521: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
Running with device: Radeon RX 580 Series
[1]    1490668 segmentation fault (core dumped)  python test_pytorch.py

Notice how torch is finding & identifying my GPU, but barfing a segfault.

Additional info:

$ python --version
Python 3.10.0
$ pip freeze | grep torch
clip-anytorch==2.5.0
pytorch-lightning==1.9.0
torch==2.0.0.dev20230207+rocm5.3
torch-fidelity==0.3.0
torchaudio==2.0.0.dev20230208+rocm5.3
torchdiffeq==0.2.3
torchmetrics==0.11.1
torchsde==0.2.5
torchvision==0.15.0.dev20230208+rocm5.3
$ cat .env # which I source
export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_FWD=0
export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_BWD=0
export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_WRW=0
export HSA_OVERRIDE_GFX_VERSION=10.3.0
export AMDGPU_TARGETS="gfx1030"
export ROCM_VERSION=5.4.3
export ROCM_HOME=/opt/rocm-${ROCM_VERSION}
export LD_LIBRARY_PATH=${ROCM_HOME}/lib
export CUDA_VISIBLE_DEVICES=0

This is the gdb stack trace:

GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python3...
e[?2004h(gdb) r ./test_pytorch.py 
e[?2004l
Starting program: /home/user/invokeai/.venv/bin/python3 ./test_pytorch.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffe1d7ff640 (LWP 1497576)]
[New Thread 0x7ffe1affe640 (LWP 1497577)]
[New Thread 0x7ffe187fd640 (LWP 1497578)]
[New Thread 0x7ffe17ffc640 (LWP 1497579)]
[New Thread 0x7ffe137fb640 (LWP 1497580)]
[New Thread 0x7ffe10ffa640 (LWP 1497581)]
[New Thread 0x7ffe107f9640 (LWP 1497582)]
[New Thread 0x7ffe09a7e640 (LWP 1497620)]
[New Thread 0x7ffe0927d640 (LWP 1497621)]
[Thread 0x7ffe0927d640 (LWP 1497621) exited]
/home/user/invokeai/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py:521: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x00007fff5b0df266 in hsaKmtDestroyQueue () from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libhsa-runtime64.so
e[?2004he[?2004l
e[?2004h(gdb) where
e[?2004l
#0  0x00007fff5b0df266 in hsaKmtDestroyQueue ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libhsa-runtime64.so
#1  0x00007fff5b0df750 in hsaKmtCreateQueue ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libhsa-runtime64.so
#2  0x00007fff5b034b4c in rocr::AMD::AqlQueue::AqlQueue(rocr::AMD::GpuAgent*, unsigned long, unsigned int, rocr::AMD::ScratchCache::ScratchInfo&, void (*)(hsa_status_t, hsa_queue_s*, void*), void*, bool) ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libhsa-runtime64.so
#3  0x00007fff5b028036 in rocr::AMD::GpuAgent::QueueCreate(unsigned long, unsigned int, void (*)(hsa_status_t, hsa_queue_s*, void*), void*, unsigned int, unsigned int, rocr::core::Queue**) ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libhsa-runtime64.so
#4  0x00007fff5b02922f in rocr::AMD::GpuAgent::CreateInterceptibleQueue(void (*)(hsa_status_t, hsa_queue_s*, void*), void*) () from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libhsa-runtime64.so
#5  0x00007fff5b029277 in std::_Function_handler<rocr::core::Queue* (), rocr::AMD::GpuAgent::InitDma()::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libhsa-runtime64.so
#6  0x00007fff5b02821e in rocr::AMD::GpuAgent::QueueCreate(unsigned long, unsigned int, void (*)(hsa_status_t, hsa_queue_s*, void*), void*, unsigned int, unsigned int, rocr::core::Queue**) ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libhsa-runtime64.so
#7  0x00007fff5b03fede in rocr::HSA::hsa_queue_create(hsa_agent_s, unsigned int, unsigned int, void (*)(hsa_status_t, hsa_queue_s*, void*), void*, unsigned int, unsigned int, hsa_queue_s**) ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libhsa-runtime64.so
#8  0x00007fff886154da in roctracer::hsa_support::hsa_queue_create_callback(hsa_agent_s, unsigned int, unsigned int, void (*)(hsa_status_t, hsa_queue_s*, void*), void*, unsigned int, unsigned int, hsa_queue_s**) ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libroctracer64.so
#9  0x00007fffafeb3fe7 in roc::Device::acquireQueue(unsigned int, bool, std::vector<unsigned int, std::allocator<unsigned int> > const&, amd::CommandQueue::Priority) ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#10 0x00007fffafec9ef9 in roc::VirtualGPU::create() ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#11 0x00007fffafeafef3 in roc::Device::createVirtualDevice(amd::CommandQueue*) ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#12 0x00007fffafe9f280 in amd::HostQueue::HostQueue(amd::Context&, amd::Device&, unsigned long, unsigned int, amd::CommandQueue::Priority, std::vector<unsigned int, std::allocator<unsigned int> > const&) ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#13 0x00007fffafdfd481 in hip::Stream::Create() ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#14 0x00007fffafdfd790 in hip::Stream::asHostQueue(bool) ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#15 0x00007fffafc9ae9e in hip::Device::NullStream(bool) ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#16 0x00007fffafd6b680 in hipMemcpyWithStream ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#17 0x00007fffb2498df3 in at::native::copy_kernel_cuda(at::TensorIterator&, bool) ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libtorch_hip.so
#18 0x00007fffdda0eec1 in at::native::copy_impl(at::Tensor&, at::Tensor const&, bool) ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#19 0x00007fffdda104d1 in at::native::copy_(at::Tensor&, at::Tensor const&, bool) ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#20 0x00007fffde5edea6 in at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool) ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#21 0x00007fffddcf99d8 in at::native::_to_copy(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#22 0x00007fffde91c35a in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd___to_copy>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat> > >, at::Tensor (at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#23 0x00007fffde19bafd in at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) () from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#24 0x00007fffde776305 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>), &at::(anonymous namespace)::_to_copy>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat> > >, at::Tensor (at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#25 0x00007fffde2156e2 in at::_ops::_to_copy::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#26 0x00007fffddcf1819 in at::native::to(at::Tensor const&, c10::Device, c10::ScalarType, bool, bool, c10::optional<c10::MemoryFormat>) () from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#27 0x00007fffdeab1982 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::Device, c10::ScalarType, bool, bool, c10::optional<c10::MemoryFormat>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd_device_to>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::Device, c10::ScalarType, bool, bool, c10::optional<c10::MemoryFormat> > >, at::Tensor (at::Tensor const&, c10::Device, c10::ScalarType, bool, bool, c10::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::Device, c10::ScalarType, bool, bool, c10::optional<c10::MemoryFormat>) ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#28 0x00007fffde374c12 in at::_ops::to_device::call(at::Tensor const&, c10::Device, c10::ScalarType, bool, bool, c10::optional<c10::MemoryFormat>) ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#29 0x00007ffff5b20adb in torch::utils::(anonymous namespace)::internal_new_from_data(c10::TensorOptions, c10::ScalarType, c10::optional<c10::Device>, _object*, bool, bool, bool, bool) ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so
#30 0x00007ffff5b256c3 in torch::utils::tensor_ctor(c10::DispatchKey, c10::ScalarType, torch::PythonArgs&) ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so
#31 0x00007ffff57b41f7 in torch::autograd::THPVariable_tensor(_object*, _object*, _object*) ()
   from /home/user/invokeai/.venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so
#32 0x00005555557364eb in cfunction_call (func=0x7ffff71aa250, args=<optimized out>, kwargs=<optimized out>)
    at Objects/methodobject.c:543
#33 0x000055555567d6b6 in _PyObject_MakeTpCall (tstate=0x555555934c00, callable=callable@entry=0x7ffff71aa250, 
    args=args@entry=0x7ffff7621ba8, nargs=<optimized out>, keywords=<optimized out>, keywords@entry=0x7ffff75a8e00)
    at Objects/call.c:215
#34 0x00005555556ea059 in _PyObject_VectorcallTstate (kwnames=0x7ffff75a8e00, nargsf=<optimized out>, 
    args=<optimized out>, callable=0x7ffff71aa250, tstate=<optimized out>) at ./Include/cpython/abstract.h:112
#35 _PyObject_VectorcallTstate (kwnames=0x7ffff75a8e00, nargsf=<optimized out>, args=0x7ffff7621ba8, 
    callable=0x7ffff71aa250, tstate=<optimized out>) at ./Include/cpython/abstract.h:99
#36 PyObject_Vectorcall (kwnames=0x7ffff75a8e00, nargsf=<optimized out>, args=0x7ffff7621ba8, 
    callable=0x7ffff71aa250) at ./Include/cpython/abstract.h:123
#37 call_function (kwnames=0x7ffff75a8e00, oparg=<optimized out>, pp_stack=<synthetic pointer>, 
    trace_info=0x7fffffffca60, tstate=<optimized out>) at Python/ceval.c:5888
#38 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>)
    at Python/ceval.c:4239
#39 0x00005555556e2bf4 in _PyEval_EvalFrame (throwflag=0, f=0x7ffff7621a40, tstate=0x555555934c00)
    at ./Include/internal/pycore_ceval.h:46
#40 _PyEval_Vector (tstate=<optimized out>, con=<optimized out>, locals=<optimized out>, args=<optimized out>, 
    argcount=0, kwnames=0x0) at Python/ceval.c:5073
#41 0x000055555577a01f in PyEval_EvalCode (co=co@entry=0x7ffff7537470, globals=globals@entry=0x7ffff753a640, 
    locals=locals@entry=0x7ffff753a640) at Python/ceval.c:1134
#42 0x000055555578ebb9 in run_eval_code_obj (tstate=0x555555934c00, co=0x7ffff7537470, globals=0x7ffff753a640, 
    locals=0x7ffff753a640) at Python/pythonrun.c:1289
#43 0x000055555578eb44 in run_mod (mod=<optimized out>, filename=<optimized out>, globals=0x7ffff753a640, 
    locals=0x7ffff753a640, flags=<optimized out>, arena=<optimized out>) at Python/pythonrun.c:1310
#44 0x0000555555609cc3 in pyrun_file (fp=fp@entry=0x5555559397f0, filename=filename@entry=0x7ffff7405470, 
    start=start@entry=257, globals=globals@entry=0x7ffff753a640, locals=locals@entry=0x7ffff753a640, 
    closeit=closeit@entry=1, flags=0x7fffffffcdc8) at Python/pythonrun.c:1206
#45 0x0000555555609aa5 in _PyRun_SimpleFileObject (fp=fp@entry=0x5555559397f0, 
    filename=filename@entry=0x7ffff7405470, closeit=closeit@entry=1, flags=flags@entry=0x7fffffffcdc8)
    at Python/pythonrun.c:455
#46 0x0000555555609d6b in _PyRun_AnyFileObject (fp=fp@entry=0x5555559397f0, filename=filename@entry=0x7ffff7405470, 
    closeit=closeit@entry=1, flags=flags@entry=0x7fffffffcdc8) at Python/pythonrun.c:89
#47 0x0000555555752e68 in pymain_run_file_obj (skip_source_first_line=<optimized out>, filename=0x7ffff7405470, 
    program_name=0x7ffff74054d0) at Modules/main.c:353
#48 pymain_run_file (config=0x5555559190d0) at Modules/main.c:372
#49 pymain_run_python (exitcode=0x7fffffffcdc0) at Modules/main.c:587
#50 Py_RunMain () at Modules/main.c:666
#51 0x000055555575291d in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:720
#52 0x00007ffff7c96d90 in __libc_start_call_main (main=main@entry=0x555555679150 <main>, argc=argc@entry=2, 
    argv=argv@entry=0x7fffffffcff8) at ../sysdeps/nptl/libc_start_call_main.h:58
#53 0x00007ffff7c96e40 in __libc_start_main_impl (main=0x555555679150 <main>, argc=2, argv=0x7fffffffcff8, 
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffcfe8)
    at ../csu/libc-start.c:392
#54 0x000055555575281e in _start ()
e[?2004h(gdb) q
e[?2004l
e[?2004hA debugging session is active.

	Inferior 1 [process 1497490] will be killed.

Quit anyway? (y or n) y
e[?2004l