Suppose we want to train 50 models independently, even if you have access to an online gpu clustering service you can probably only submit say10 tasks at one time. I want to figure out if it is possible to put all 50 models to multiprocessing training in one single script and train all of them concurrently. They are all independent models so there is no information exchange issue.
I was reading some posts discussing this topic, and seems some believe this is overall impossible, because only cpu do multiprocessing, while gpu can only train one model at one time. I still try two approaches, but they both seem to behave weirdly. The two approaches are both trying to train 50 models on gpus cuda:0 to cuda:3
- The first method I try is the build-in torch.multiprocessing. Suppose you have a console input args.num_gpu recording the current available number of gpus,
num_gpu = args.num_gpu
id_list # <- a list of 50 elements ,each record the id for the specific model we want to train
DEVICE = 'cuda:{}'
processes = []
mp.set_start_method("spawn")
for pid, id_info in enumerate(id_list):
gpu_id = pid % num_gpu
process = mp.Process(target=run, args=(cfg, args, id_info, torch.device(DEVICE.format(gpu_id))))
process.start()
processes.append(process)
The method run here takes the assigned gpu and the model information, initializes the model, optimizer, loss function, datasets, dataloader, etc. and trains the model in a number of epochs.
What I observe is that some models are training, some models’ training process never get launched, and some models forward passing processes have the error of two tensors are on two devices, cuda:0 and cuda:X where X can be any gpu I assign in this task. The whole training behavior is messy and unpredictable.
- The next method is joblib,
with parallel_backend('loky', n_jobs=-1):
parallel = Parallel(n_jobs=-1)
print(f"Number of jobs running: {parallel.n_jobs}")
n_jobs = parallel.n_jobs
parallel(
delayed(run)(cfg, args, id_pair, torch.device(DEVICE.format(pid % num_gpu)))
for pid, id_info in enumerate(id_list)
)
If I set the number of gpu to be 1, it only trains one model. If I increase the number of gpu, then it trains that number of models simultaneously, but then during the forward pass I once again get an error says the tensors are on two different devices, and the two reported devices have some randomness, like the first time it reports they are on cuda:0 and cuda:1, the next time I rerun the script it says they are on cuda:0 and cuda:3 .
So my question is, does this whole idea of training a large amount of model simultaneously ever make sense? Or just as what I saw from the previous discussion that pytorch & gpu does not actually support it?