I’m trying to train LoFTR on 4 RTX 3090 GPUs on Ubuntu 18.04. When I start training, the output gets stuck on “initializing ddp: GLOBAL_RANK” and the terminal freezes (Ctrl + C won’t work anymore).
I saw that others had that problem with certain Pytorch / Pytorch Lightning versions, however i’m using pytorch-lightning==1.3.5 and pytorch=1.8.1, which noone else seemed to have problems with. Also, the authors trained LoFTR with the same environment, so I think the problem has to be somewhere else.
Does anyone have an idea what else apart from the Pytorch / Pytorch Lightning versions could be the problem?
Thanks!
P.S. I ran the cuda p2pBandwithLatencyTest and it shows that GPUs can’t access each other. However I’m not sure if this is needed in this case or if they can just pass the information via CPU.