Why does init_process_group hang with world size > 1?

I remember reading about some issues with IOMMU and p2p communication on RTX4090 cards. Can you try the solution here `torch.distributed.init_process_group` hangs with 4 gpus with `backend="NCCL"` but not `"gloo"` - #6 by ritwik.m07 ?