'unhandled system error' when training with multi nodes

Here’s one way to see if nccl is installed on the node:

locate nccl| grep "libnccl.so" | tail -n1 | sed -r 's/^.*\.so\.//'