What are the causes/solutions of nccl unpredictable behavior?

I have an issue where the same script with the same settings, in terms of memory and number of GPUs,…etc runs correctly (I can see outputs and errors) at one point, but then when I rerun it, it suddenly hangs (no output, no errors). It has slowed me down significantly since I have no clue what the issue is.

  • The port number is updated for each execution.
  • I do wait for some time before rerunning it, and I do execute only one job at a time (just in case).
  • I have a proper initialization and closing of nccl connections in the code.
  • For export NCCL_DEBUG=INFO, it stops at
    [0] NCCL INFO comm 0x7f7b7c002e10 rank 0 nranks 1 cudaDev 0 busId 1000 - Init COMPLETE,
    but none of the print statements in the code gets executed afterward.

Is there something I could do to not get stuck at this? Am I missing something? The issue above also happens sometimes when executing one code: part of the code gets executed and then it hangs on the rest.