DDP (via Lightning/Fabric) training hang with 100% GPU utilization

To future readers:

Solved this problem thanks to this: Single Machine DDP Issue on A6000 GPU

The tl;dr is that, for me, setting this env var was enough to fix the problem entirely: NCCL_P2P_DISABLE=1

If you are using conda you probably want to export this var to the activation script which is run every time you start an env, like so:

echo 'export NCCL_P2P_DISABLE=1' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

After doing the above, problem should be fixed for you forever (“forever”)…at least if you’re as lucky as I was (: