Yolox multinode training not able to run

Hi, I do understand this issue should be in YOLOX repo but I think project is no longer active and no one is answering so decided to try my luck here.

I am trying to train YOLOX on 2 nodes, each with 8 gpus. both servers can be can be connected with ssh. I am trying to run multi node training as shown in yolox but after starting multinode script, it initializes gpus and then it hangs and doesn’t move (pytorch DDP). using horovod for distributed training working fine.


python 3.8.10
torch 1.13.1+cu116
torchvision 0.14.1+cu116
cuda 11.6