Training Job Stalls with no Logs & GPU Usage Spike

Jake_Williams · July 27, 2020, 2:59pm

Any chance it could related to NCCL? Found this thread with a similar stack trace: Distributed training hangs which references this github issue: https://github.com/pytorch/pytorch/issues/20630.

I have pytorch 1.4.0+cu100 and apt search nccl returned

libnccl-dev/unknown 2.7.6-1+cuda11.0 amd64
  NVIDIA Collectives Communication Library (NCCL) Development Files

libnccl2/unknown 2.7.6-1+cuda11.0 amd64
  NVIDIA Collectives Communication Library (NCCL) Runtime

Which potentially is a version mismatch? On the other hand I’m not using multiple GPU’s which is what NCCL seems to be meant for.