Training in DDP mode hang


I’m currently training a model with custom ops and I found that the training hangs randomly during training in DDP mode, at loss.backward() while calculating the backward gradient of the custom ops. After a few experiments and I now suspect there might be some issues in the synchronisation of model during back propopagation.
I saw DDP requires Reducer instances on all processes to invoke allreduce in exactly the same order, which is done by always running allreduce in the bucket index order instead of actual bucket ready order. Mismatched allreduce order across processes can lead to wrong results or DDP backward hang. in the official torch DDP web, I wonder if I get some help from the community to see the bucket index and the allreduce order.