@mrshenli How to catch exceptions rpc.async, rpc.sync and rpc.remote thrown in the caller under the following conditions, suppose a timeout is set globally (or per call):
- during execution, the target process crashes and exits, also closing down all rpc execution threads.
- during execution, connection to the target process is closed
- during execution, the timeout limit is reached
- during execution, an exception is raised in the executed function
Based on my experiments, my partial answer is:
- Not known ?
- A
RuntimeError, something like “peer reset” - An uncatchable
std::runtime_error, something like:
terminate called after throwing an instance of 'std::runtime_error'
what(): RPC ran for more than 5000 milliseconds and timed out.
- the exception thrown by the function, not the original exception, but wrapped in a udf exception and reraised on the caller side.
The third one troubles me the most because std::runtime_error will cause an ugly Fatal Python Error:
Fatal Python error: Aborted
Thread 0x00007f916abab700 (most recent call first):
File "/home/Administrator/iffi/Projects/machin/machin/parallel/distributed/world.py", line 63 in _rpc_call_remote_method
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/torch/distributed/rpc/internal.py", line 153 in _run_function
Thread 0x00007f91693a8700 (most recent call first):
File "/home/Administrator/iffi/Projects/machin/machin/parallel/distributed/world.py", line 75 in _rpc_get_remote_paired_value
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/torch/distributed/rpc/internal.py", line 153 in _run_function
Thread 0x00007f9163fff700 (most recent call first):
File "/home/Administrator/iffi/Projects/machin/machin/parallel/distributed/world.py", line 75 in _rpc_get_remote_paired_value
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/torch/distributed/rpc/internal.py", line 153 in _run_function
Thread 0x00007f91527fc700 (most recent call first):
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/torch/distributed/rpc/api.py", line 554 in rpc_sync
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/torch/distributed/rpc/api.py", line 77 in wrapper
File "/home/Administrator/iffi/Projects/machin/machin/parallel/distributed/world.py", line 756 in _rpc_paired_class_call
File "/home/Administrator/iffi/Projects/machin/machin/parallel/distributed/world.py", line 597 in rpc_paired_class_sync
File "/home/Administrator/iffi/Projects/machin/test/parallel/distributed/test_world.py", line 97 in main
File "/home/Administrator/iffi/Projects/machin/machin/parallel/distributed/world.py", line 46 in _exec_role
File "/usr/lib/python3.5/threading.py", line 862 in run
File "/home/Administrator/iffi/Projects/machin/machin/parallel/thread.py", line 47 in run
File "/usr/lib/python3.5/threading.py", line 914 in _bootstrap_inner
File "/usr/lib/python3.5/threading.py", line 882 in _bootstrap
Thread 0x00007f9152ffd700 (most recent call first):
File "/home/Administrator/iffi/Projects/machin/machin/parallel/distributed/election.py", line 423 in _task_timeout
File "/usr/lib/python3.5/threading.py", line 862 in run
File "/home/Administrator/iffi/Projects/machin/machin/parallel/thread.py", line 47 in run
File "/usr/lib/python3.5/threading.py", line 914 in _bootstrap_inner
File "/usr/lib/python3.5/threading.py", line 882 in _bootstrap
Thread 0x00007f91537fe700 (most recent call first):
File "/home/Administrator/iffi/Projects/machin/machin/parallel/distributed/election.py", line 435 in _task_keep_alive
File "/usr/lib/python3.5/threading.py", line 862 in run
File "/home/Administrator/iffi/Projects/machin/machin/parallel/thread.py", line 47 in run
File "/usr/lib/python3.5/threading.py", line 914 in _bootstrap_inner
File "/usr/lib/python3.5/threading.py", line 882 in _bootstrap
Thread 0x00007f9153fff700 (most recent call first):
File "/usr/lib/python3.5/threading.py", line 297 in wait
File "/usr/lib/python3.5/queue.py", line 173 in get
File "/home/Administrator/iffi/Projects/machin/machin/parallel/distributed/election.py", line 491 in _task_handle
File "/usr/lib/python3.5/threading.py", line 862 in run
File "/home/Administrator/iffi/Projects/machin/machin/parallel/thread.py", line 47 in run
File "/usr/lib/python3.5/threading.py", line 914 in _bootstrap_inner
File "/usr/lib/python3.5/threading.py", line 882 in _bootstrap
Thread 0x00007f9160ff9700 (most recent call first):
File "/usr/lib/python3.5/threading.py", line 293 in wait
File "/home/Administrator/iffi/Projects/machin/machin/parallel/event.py", line 66 in wait
File "/home/Administrator/iffi/Projects/machin/machin/parallel/distributed/role_dispatcher.py", line 234 in _task_dispatch
File "/usr/lib/python3.5/threading.py", line 862 in run
File "/home/Administrator/iffi/Projects/machin/machin/parallel/thread.py", line 47 in run
File "/usr/lib/python3.5/threading.py", line 914 in _bootstrap_inner
File "/usr/lib/python3.5/threading.py", line 882 in _bootstrap
Thread 0x00007f91617fa700 (most recent call first):
File "/usr/lib/python3.5/threading.py", line 293 in wait
File "/home/Administrator/iffi/Projects/machin/machin/parallel/event.py", line 66 in wait
File "/home/Administrator/iffi/Projects/machin/machin/parallel/distributed/world.py", line 302 in _task_run_dispatched_roles
File "/usr/lib/python3.5/threading.py", line 862 in run
File "/home/Administrator/iffi/Projects/machin/machin/parallel/thread.py", line 47 in run
File "/usr/lib/python3.5/threading.py", line 914 in _bootstrap_inner
File "/usr/lib/python3.5/threading.py", line 882 in _bootstrap
Thread 0x00007f91e4362700 (most recent call first):
File "/home/Administrator/iffi/Projects/machin/test/parallel/distributed/test_world.py", line 145 in subproc_start_world_with_roles
File "/home/Administrator/iffi/Projects/machin/test/parallel/util_run_multi.py", line 16 in process_main
File "/usr/lib/python3.5/multiprocessing/process.py", line 93 in run
File "/home/Administrator/iffi/Projects/machin/machin/parallel/process.py", line 52 in run
File "/usr/lib/python3.5/multiprocessing/process.py", line 249 in _bootstrap
File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 74 in _launch
File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 20 in __init__
File "/usr/lib/python3.5/multiprocessing/context.py", line 267 in _Popen
File "/home/Administrator/iffi/Projects/machin/machin/parallel/process.py", line 25 in _Popen
File "/usr/lib/python3.5/multiprocessing/process.py", line 105 in start
File "/home/Administrator/iffi/Projects/machin/test/parallel/util_run_multi.py", line 27 in processes
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/_pytest/fixtures.py", line 788 in call_fixture_func
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/_pytest/fixtures.py", line 964 in pytest_fixture_setup
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/pluggy/manager.py", line 87 in <lambda>
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/_pytest/fixtures.py", line 914 in execute
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/_pytest/fixtures.py", line 584 in _compute_fixture_value
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/_pytest/fixtures.py", line 503 in _get_active_fixturedef
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/_pytest/fixtures.py", line 487 in getfixturevalue
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/_pytest/fixtures.py", line 477 in _fillfixtures
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/_pytest/fixtures.py", line 297 in fillfixtures
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/_pytest/python.py", line 1483 in setup
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/_pytest/runner.py", line 373 in prepare
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/_pytest/runner.py", line 123 in pytest_runtest_setup
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/pluggy/manager.py", line 87 in <lambda>
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/_pytest/runner.py", line 217 in <lambda>
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/_pytest/runner.py", line 244 in from_call
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/_pytest/runner.py", line 217 in call_runtest_hook
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/_pytest/runner.py", line 186 in call_and_report
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/_pytest/runner.py", line 94 in runtestprotocol
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/_pytest/runner.py", line 85 in pytest_runtest_protocol
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/pluggy/manager.py", line 87 in <lambda>
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/_pytest/main.py", line 272 in pytest_runtestloop
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/pluggy/manager.py", line 87 in <lambda>
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/_pytest/main.py", line 247 in _main
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/_pytest/main.py", line 191 in wrap_session
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/_pytest/main.py", line 240 in pytest_cmdline_main
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/pluggy/manager.py", line 87 in <lambda>
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/Administrator/iffi/Projects/machin/venv/lib/python3.5/site-packages/_pytest/config/__init__.py", line 125 in main
File "/data/software/pycharm/pycharm-2020.1.2/plugins/python/helpers/pycharm/_jb_pytest_runner.py", line 43 in <module>
Is there any clean way to deal with the first three conditions? The fourth one is simple. And why pybind11 is not converting the third std::runtime_error to a catchable python RuntimeError ?
It’s really sad to hear that currently pytorch rpc cannot handle the 1st and 2nd condition, since that’s what my application code is designed to do. I will try to repoduce the 3rd condition with simpler code, but that might be very difficult since currently there is no way to log all events just before the “fatal abort” happens.
