Hi all, I’m working on a single node, distributed object detection codebase where we currently run a single training script for each training run using torchrun. The “classic”
torchrun --nproc-per-node /my/training/script.py
setup.
I’m wondering if it possible/practical to instead run this as part of a larger python service. What we’d like to be able to do is take the functionality that exists in script.py and move it into a function, call it script() and then have some service that can spin off a training subprocess and run a function that would kick off however many processes you have gpus and run that scrip() function instead of a script.py file.
This would allow us to more easily chain together training and evaluation processes and stack multiple different types of training scripts together if it made sense for our experiment.
Follow up, we were able to come up with a solution to this. The basic idea is to bypass the run function used by torchrun and directly use the elastic launcher. Below is the key bit we had to come up with:
The training subprocess function (which I’m not showing above) is then our standard training loop. That function would have been a script if we were using torchrun.