To future readers:
Solved this problem thanks to this: Single Machine DDP Issue on A6000 GPU
The tl;dr is that, for me, setting this env var was enough to fix the problem entirely: NCCL_P2P_DISABLE=1
If you are using conda you probably want to export this var to the activation script which is run every time you start an env, like so:
echo 'export NCCL_P2P_DISABLE=1' > $CONDA_PREFIX/etc/conda/activate.d/
After doing the above, problem should be fixed for you forever (“forever”)…at least if you’re as lucky as I was (: