Combining torchrun and torch.multiprocessing

Hi all,

I’ve worked myself into a bit of a pickle that I was hoping someone here might be able to help me out with. I have some torch code that for training requires using DDP through torchrun. At inference time, I have a process that I was able to build up through dataparallel, however the process itself became incredibly brittle, with repeated and consistent cuda blocking errors across two different systems. I was able to rebuild my inference time code through torch.multiprocessing, however running calling a function that establishes a torch.multiprocessing group within torchrun (even when calling it from local_rank = 0 only) hangs after attempting to establish the first process - I presume because you end up with two different processes trying to set and access local_rank on the backend.

I can call both functions independently, however I’d like to be able to wrap it all together into the one piece of code, and was wondering if anybody had any suggestions.

Hopefully an interesting enough question for those with knowledge of torchrun and torch.multiprocessing. Thanks in advance

This is a good question for @ptrblck