I was running into this error when using PT1.10.0
/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/smplib.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZTVN4c10d17AllReduceCommHookE
Same issue does not exist in PT1.8. This hook is defined in default_comm_hooks.hpp
So I’m checking the symbols from lib torch:
In PT1.8
nm -gDC /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so | grep AllreduceCommHook
0000000000a06ee0 T c10d::AllReduceCommHook::runHook(c10d::GradBucket&)
000000000125d518 V typeinfo for c10d::AllReduceCommHook
0000000000dd7ca0 V typeinfo name for c10d::AllReduceCommHook
000000000125d568 V vtable for c10d::AllReduceCommHook
But in PT 1.10.0
nm -gDC /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so | grep AllreduceCommHook
returns nothing.
I also observed the same behavior for FP16CompressCommHook
which is defined in the same file. Could someone let me know where can I find those symbols in PT1.10.0?