I have 50 completely independent models I want to train in parallel on 8 gpus. I have the model training run in a script that I run like
python training_script.py device_num
The simple way to do this is
for group in groups:
processes = [subprocess.POpen(f'python training_script.py {device}'.split()) for device in range(8)]
[p.wait() for p in process]
where groups, are the 50 processes split into groups of 8.
The downside of this is some models take longer than others to train and all models need to finish before it moves to the next group.
I was hoping to do something like multiprocess.spawn, but I need the last process to return the device number so it is clear which device is open to run on. I tried using Queue and Process from multiprocessing but I can’t get more than 1 process to run at once.
Any help would be very appreciated. Thanks