Torch RPC core dumped "CUDAStream.cpp":254, please report a bug to PyTorch"

fedml@ip-172-31-46-221:/home/ec2-user/FedML/fedml_core/distributed/test/test_rpc$ sh run_rpc.sh TRPC 0
rank - 0 - 2021-10-13,03:21:54.277 main.py[line:86] INFO Namespace(backend='TRPC', enable_cuda_rpc=False, gpu_mapping_file='gpu_mapping.yaml', gpu_mapping_key='mapping_default', grpc_ipconfig_path='grpc_ipconfig.csv', rank=0, trpc_master_config_path='trp
c_master_config.csv')
rank - 0 - 2021-10-13,03:21:54.277 trpc_comm_manager.py[line:38] INFO using TRPC backend
Worker rank 0 initializing RPC
Creating the object
rank - 0 - 2021-10-13,03:21:54.277 trpc_comm_manager.py[line:58] INFO /home/ec2-user/FedML/fedml_core/distributed/test/test_rpc
rank - 0 - 2021-10-13,03:21:54.277 trpc_comm_manager.py[line:76] INFO str_init_method = tcp://172.31.46.221:9999
terminate called after throwing an instance of 'c10::Error'
  what():  device_index >= 0 && device_index < num_gpusINTERNAL ASSERT FAILED at "../c10/cuda/CUDAStream.cpp":254, please report a bug to PyTorch. 
Exception raised from check_gpu at ../c10/cuda/CUDAStream.cpp:254 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fdbf4447a22 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x5f (0x7fdbf44444af in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::getStreamFromPool(bool, signed char) + 0x177 (0x7fdbf469e187 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x8d05 (0x7fdbf46a0d05 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0xe40cc4 (0x7fdc4b3b4cc4 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0xe40e48 (0x7fdc4b3b4e48 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xe93170 (0x7fdc4b407170 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0xe9334f (0x7fdc4b40734f in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #8: tensorpipe::PipeImpl::callReadDescriptorCallback(tensorpipe::OpsStateMachine<tensorpipe::PipeImpl, tensorpipe::ReadOperation>::Iter) + 0x209 (0x7fdc4b409ac9 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0xea2608 (0x7fdc4b416608 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #10: tensorpipe::PipeImpl::advanceReadOperation(tensorpipe::OpsStateMachine<tensorpipe::PipeImpl, tensorpipe::ReadOperation>::Iter, tensorpipe::ReadOperation::State) + 0xf3 (0x7fdc4b400d33 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorc
h_python.so)
frame #11: <unknown function> + 0xea69c2 (0x7fdc4b41a9c2 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0xe9857d (0x7fdc4b40c57d in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #13: tensorpipe::ContextImpl::deferToLoop(std::function<void ()>) + 0x154 (0x7fdc4b3e43a4 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #14: <unknown function> + 0xe8ef31 (0x7fdc4b402f31 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #15: <unknown function> + 0xf10c95 (0x7fdc4b484c95 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #16: <unknown function> + 0xf11dc3 (0x7fdc4b485dc3 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #17: tensorpipe::transport::uv::ConnectionImpl::readCallbackFromLoop(long, uv_buf_t const*) + 0x420 (0x7fdc4b502680 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #18: <unknown function> + 0xf929cf (0x7fdc4b5069cf in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0x108482f (0x7fdc4b5f882f in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #20: <unknown function> + 0x1084e6c (0x7fdc4b5f8e6c in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #21: uv__io_poll + 0x356 (0x7fdc4b5fd646 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #22: uv_run + 0x107 (0x7fdc4b5f2f27 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #23: tensorpipe::transport::uv::Loop::eventLoop() + 0x1d (0x7fdc4b50bf5d in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #24: <unknown function> + 0xf8074c (0x7fdc4b4f474c in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #25: std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tensorpipe::EventLoopDeferredExecutor::*)(std::string), tensorpipe::EventLoopDeferredExecutor*, std::string> > >::_M_run() + 0x41 (0x7fdc4b4f3ff1 in /usr/local/lib/python3.6/dist-
packages/torch/lib/libtorch_python.so)
frame #26: <unknown function> + 0xbd6df (0x7fdc4e7b96df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #27: <unknown function> + 0x76db (0x7fdc630026db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #28: clone + 0x3f (0x7fdc6333ba3f in /lib/x86_64-linux-gnu/libc.so.6)


[ip-172-31-46-221:01840] *** Process received signal ***
[ip-172-31-46-221:01840] Signal: Aborted (6)
[ip-172-31-46-221:01840] Signal code:  (-6)
[ip-172-31-46-221:01840] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3efd0)[0x7fdc63258fd0]
[ip-172-31-46-221:01840] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fdc63258f47]
[ip-172-31-46-221:01840] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fdc6325a8b1]
[ip-172-31-46-221:01840] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c957)[0x7fdc4e788957]
[ip-172-31-46-221:01840] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92ae6)[0x7fdc4e78eae6]
[ip-172-31-46-221:01840] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92b21)[0x7fdc4e78eb21]
[ip-172-31-46-221:01840] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d54)[0x7fdc4e78ed54]
[ip-172-31-46-221:01840] [ 7] /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so(_ZN3c106detail14torchCheckFailEPKcS2_jS2_+0x8a)[0x7fdbf44444da]
[ip-172-31-46-221:01840] [ 8] /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so(_ZN3c104cuda17getStreamFromPoolEba+0x177)[0x7fdbf469e187]
[ip-172-31-46-221:01840] [ 9] /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so(+0x8d05)[0x7fdbf46a0d05]
[ip-172-31-46-221:01840] [10] /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so(+0xe40cc4)[0x7fdc4b3b4cc4]
[ip-172-31-46-221:01840] [11] /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so(+0xe40e48)[0x7fdc4b3b4e48]
[ip-172-31-46-221:01840] [12] /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so(+0xe93170)[0x7fdc4b407170]
[ip-172-31-46-221:01840] [13] /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so(+0xe9334f)[0x7fdc4b40734f]
[ip-172-31-46-221:01840] [14] /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so(_ZN10tensorpipe8PipeImpl26callReadDescriptorCallbackENS_15OpsStateMachineIS0_NS_13ReadOperationEE4IterE+0x209)[0x7fdc4b409ac9]
[ip-172-31-46-221:01840] [15] /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so(+0xea2608)[0x7fdc4b416608]
[ip-172-31-46-221:01840] [16] /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so(_ZN10tensorpipe8PipeImpl20advanceReadOperationENS_15OpsStateMachineIS0_NS_13ReadOperationEE4IterENS2_5StateE+0xf3)[0x7fdc4b400d33]
[ip-172-31-46-221:01840] [17] /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so(+0xea69c2)[0x7fdc4b41a9c2]
[ip-172-31-46-221:01840] [18] /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so(+0xe9857d)[0x7fdc4b40c57d]
[ip-172-31-46-221:01840] [19] /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so(_ZN10tensorpipe11ContextImpl11deferToLoopESt8functionIFvvEE+0x154)[0x7fdc4b3e43a4]
[ip-172-31-46-221:01840] [20] /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so(+0xe8ef31)[0x7fdc4b402f31]
[ip-172-31-46-221:01840] [21] /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so(+0xf10c95)[0x7fdc4b484c95]
[ip-172-31-46-221:01840] [22] /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so(+0xf11dc3)[0x7fdc4b485dc3]
[ip-172-31-46-221:01840] [23] /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so(_ZN10tensorpipe9transport2uv14ConnectionImpl20readCallbackFromLoopElPK8uv_buf_t+0x420)[0x7fdc4b502680]
[ip-172-31-46-221:01840] [24] /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so(+0xf929cf)[0x7fdc4b5069cf]
[ip-172-31-46-221:01840] [25] /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so(+0x108482f)[0x7fdc4b5f882f]
[ip-172-31-46-221:01840] [26] /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so(+0x1084e6c)[0x7fdc4b5f8e6c]
[ip-172-31-46-221:01840] [27] /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so(uv__io_poll+0x356)[0x7fdc4b5fd646]
[ip-172-31-46-221:01840] [28] /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so(uv_run+0x107)[0x7fdc4b5f2f27]
[ip-172-31-46-221:01840] [29] /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so(_ZN10tensorpipe9transport2uv4Loop9eventLoopEv+0x1d)[0x7fdc4b50bf5d]
[ip-172-31-46-221:01840] *** End of error message ***
Aborted (core dumped)

Looks like it’s trying to get a stream on an invalid CUDA device? Can you share the code hitting this error? And what’s the HW setup? E.g., how many GPUs on both sides and what type of GPUs.

cc @lcw have you seen similar errors before?

and which PyTorch version are you using?

This looks a lot like the memory corruption issue I spent an entire week chasing and fixing. The fix is in https://github.com/pytorch/pytorch/pull/60470, hence it’s only available in PyTorch 1.10 for now, sorry.

If my memory serves me right, this problem occurs when an object contained in an RRef is mutated in-place, in particular when one of its tensors is removed. (Like, imagine an RRef holding a dict of tensors, and one of those items being popped from the dict).

My fix ensures that in those cases we at least don’t crash, however it’s not a “full” fix for these kind of scenarios, because the desired behavior for mutating RRefs is, in my view, poorly specified. In general, I’d strongly recommend to use RRefs as immutable. If you need to modify an RRef you should always be able to extract its value, modify it as you want, and then re-wrap the new version of that value in a new RRef (and stop using the old one). This would be fully safe.