DDP with SLURM hangs

I am getting this error when I train with DDP on a single node with 4 GPUs. The error occurs at random iterations. Also when the process hangs GPU optimization is at 100%.

ProcessGroupNCCL’s watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 1

I am having a similar issue, though the position during training which the hang occurs is consistent for my script. Is your problem agnostic to the hardware? It only happens for me on H100s, while using A40s has no issues.