Training freezes at “initializing ddp: GLOBAL_RANK.."

I’m trying to train LoFTR on 4 RTX 3090 GPUs on Ubuntu 18.04. When I start training, the output gets stuck on “initializing ddp: GLOBAL_RANK” and the terminal freezes (Ctrl + C won’t work anymore).

I saw that others had that problem with certain Pytorch / Pytorch Lightning versions, however i’m using pytorch-lightning==1.3.5 and pytorch=1.8.1, which noone else seemed to have problems with. Also, the authors trained LoFTR with the same environment, so I think the problem has to be somewhere else.

Does anyone have an idea what else apart from the Pytorch / Pytorch Lightning versions could be the problem?

Thanks!

P.S. I ran the cuda p2pBandwithLatencyTest and it shows that GPUs can’t access each other. However I’m not sure if this is needed in this case or if they can just pass the information via CPU.

Try to disable p2p access via NCCL_P2P_DISABLE=1 and check if this would help.

When i ran torch.tensor(1).cuda(), it returned

NVIDIA GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.

So seems to be that I have to use a higher CUDA version. Thank you!

Yes, the error shows you’ve installed a PyTorch binary with CUDA 10.2, which is not compatible with your 3090. Update PyTorch to any binary in the current stable or nightly release and it will work (as all new binaries use CUDA 11.7, 11.8, or 12.1).