Hi,
I’m trying to understand synchronization options when using torchrun.
multiprocessing.Lock and multiprocessing.Condition rely on process inheritance (fork / mp.spawn), but torchrun launches ranks via spawn + exec, so these primitives don’t seem to work across ranks.
Questions:
-
Is it fundamentally unsupported to use
multiprocessing.Lock/Conditionwithtorchrun? -
Would creating a
multiprocessing.Managerbeforetorchrunand having all ranks connect to it be considered supported or recommended? -
What is the intended torch-native replacement for Condition-like (wait/notify) semantics in
torchrunjobs?
Thanks!