Training multiple independent moddels at once

Nhan_Nguyen · September 23, 2022, 3:54pm

I have 50 completely independent models I want to train in parallel on 8 gpus. I have the model training run in a script that I run like

python training_script.py device_num

The simple way to do this is

for group in groups:
    processes = [subprocess.POpen(f'python training_script.py {device}'.split()) for device in range(8)]
    [p.wait() for p in process]

where groups, are the 50 processes split into groups of 8.

The downside of this is some models take longer than others to train and all models need to finish before it moves to the next group.

I was hoping to do something like multiprocess.spawn, but I need the last process to return the device number so it is clear which device is open to run on. I tried using Queue and Process from multiprocessing but I can’t get more than 1 process to run at once.

Any help would be very appreciated. Thanks

Unuu_U · January 22, 2024, 3:41am

Is this issue solved? I’m having a similar problem but I’m using torch.multiprocessing.Pool where I pass the id of the accelerator device similar to yours.
The problem now becomes that I have to use 0 num_workers for dataloader.