I am attempting to use Pytorch Distributed RPC with a large number of requests in flight, using multiprocessing on a single machine. I am using the latest nightly Pytorch.
I am finding that there is a bottleneck inside a weird place in the pytorch code. The vast majority of time is spent in _recursive_compile_class
. This seems wrong to me, because I shouldn’t be compiling code for every RPC (every RPC is for the same function).
Here is the most common stack trace:
__pthread_cond_timedwait (/lib/x86_64-linux-gnu/libpthread-2.31.so:0)
> exists (/usr/lib/python3.9/genericpath.py:19)
> getsourcefile (/usr/lib/python3.9/inspect.py:706)
> findsource (/usr/lib/python3.9/inspect.py:817)
> getsourcelines (/usr/lib/python3.9/inspect.py:1006)
> getsource (/usr/lib/python3.9/inspect.py:1024)
> get_type_hint_captures (/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/_jit_internal.py:321)
> createResolutionCallbackForClassMethods (/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/_jit_internal.py:376)
> _recursive_compile_class (/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/jit/_script.py:1164)
> pybind11::detail::simple_collector<(pybind11::return_value_policy)1>::call (/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/lib/libtorch_python.so:0)
> torch::jit::tryToInferType (/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/lib/libtorch_python.so:0)
> c10::ivalue::ConcretePyObjectHolder::tryToInferType (/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/lib/libtorch_python.so:0)
> c10::IValue::getSubValues (/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so:0)
> at::cuda::CUDAFuture::extractDataPtrs (/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/lib/libtorch_python.so:0)
> at::cuda::CUDAFuture::preMarkCompletedHook (/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/lib/libtorch_python.so:0)
> c10::ivalue::Future::markCompleted (/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/lib/libtorch_python.so:0)
> std::_Function_handler<void (), c10::ivalue::Future::then(std::function<c10::IValue()>, std::shared_ptr<c10::Type>)::{lambda()#1}>::_M_invoke (/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/lib/libtorch_python.so:0)
> std::_Function_handler<void (), at::cuda::CUDAFuture::wrapCallback(std::function<void ()>)::{lambda()#1}>::_M_invoke (/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/lib/libtorch_python.so:0)
> c10::ivalue::Future::markCompletedWithDataPtrs (/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/lib/libtorch_python.so:0)
> std::_Function_handler<void (), torch::distributed::rpc::toPyJitFuture(std::shared_ptr<c10::ivalue::Future> const&, bool)::{lambda()#1}>::_M_invoke (/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/lib/libtorch_python.so:0)
> std::_Function_handler<void (), std::function<void ()> at::wrapPropagateTLSState<void>(std::function<void ()>)::{lambda()#1}>::_M_invoke (/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/lib/libtorch_python.so:0)
> std::_Function_handler<void (), at::cuda::CUDAFuture::wrapCallback(std::function<void ()>)::{lambda()#1}>::_M_invoke (/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/lib/libtorch_python.so:0)
> std::_Function_handler<void (), torch::distributed::rpc::TensorPipeAgent::markFutureAsComplete(std::shared_ptr<torch::distributed::rpc::TensorPipeAgent::AtomicJitFuture>, torch::distributed::rpc::Message, std::shared_ptr<torch::distributed::rpc::LazyStreamContext>)::{lambda()#1}>::_M_invoke (/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/lib/libtorch_python.so:0)
> c10::ThreadPool::main_loop (/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/lib/libc10.so:0)
> 0x7f8b6e3eeed0 (/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28:0)
My full test case is available at stone_ground_hearth_battles/test_pytorch_distributed.py at 15534b50902c52d0be39700f783d18655083a794 · JDBumgardner/stone_ground_hearth_battles · GitHub