Any chance it could related to NCCL? Found this thread with a similar stack trace: Distributed training hangs which references this github issue: https://github.com/pytorch/pytorch/issues/20630.
I have pytorch 1.4.0+cu100 and apt search nccl
returned
libnccl-dev/unknown 2.7.6-1+cuda11.0 amd64
NVIDIA Collectives Communication Library (NCCL) Development Files
libnccl2/unknown 2.7.6-1+cuda11.0 amd64
NVIDIA Collectives Communication Library (NCCL) Runtime
Which potentially is a version mismatch? On the other hand I’m not using multiple GPU’s which is what NCCL seems to be meant for.