How to launch multi-node training via torchrun if my script contains relative imports? Earlier, my script spawned workers by itself, so I could bypass torch.distributed/torchrun and launch training like this:
python -m parent.train
But now I want to use torchrun.
are you facing failures of torchrun when using relative imports? could you post a simple repro program and also attach the error paste
Hi @kimihailv,
Have you tried using absolute paths instead of relative paths? That’s general practice when running on HPCs.