I was able to identify the cause of the IndexError using gdb, so I’m sharing the steps here for reference.
1. Run the script directly without mpirun
OMP_NUM_THREADS=1 python3 test_fsdp2_mpi.py --backend mpi
This produced the same IndexError: map::at
as when using mpirun.
2. Launch the script under gdb
OMP_NUM_THREADS=1 gdb --args python3 test_fsdp2_mpi.py --backend mpi
Gdb launched correctly, however, the IndexError thrown from the C++ was not caught immediately. It likely did not propagate back to the Python main thread in a way that gdb could catch by default, and it may have been masked or wrapped by the Python runtime or pybind11 interface.
3. Set a catchpoint for all C++ exceptions
(gdb) catch throw
Catchpoint 1 (throw)
This instructs gdb to break whenever any C++ exception is thrown, regardless of whether it is propagated to Python.
As a result, the catchpoint triggered when the exception was thrown:
Thread 4 "python3" hit Catchpoint 1 (exception thrown), 0x00007f308cd914a1 in __cxa_throw()
from /lib/x86_64-linux-gnu/libstdc++.so.6
Then, running bt (backtrace) gave the following:
(gdb) bt
#0 __cxa_throw
#1 std::__throw_out_of_range
#2 std::map::at (c10d::ReduceOp::AVG)
#3 c10d::ProcessGroupMPI::_reduce_scatter_base(...)
#4 c10d::ProcessGroupMPI::runLoop()
...
Note: I also tried using import faulthandler; faulthandler.disable()
and import signal; signal.signal(signal.SIGSEGV, signal.SIG_DFL)
, but it had no effect on catching the exception.