I’m trying to implement the following paper: Population based training for a simple CIFAR classifier.
As a part of this I need to train multiple models, with different hyperparameters, in parallel (they will be fed the same data). Each of these models would then update a global dict with its validation accuracy as well as its parameters. These models would then periodically and asynchronously explore and exploit using this list.
Is there any way to simultaneously train different models (models with different hyperparameters), each on a separate GPU, in parallel? Additionally how can they be made to update a global list or dictionary? Communication doesn’t have to be synchronous.
I came across this post on the forum but the implementation seems to be sequential rather than parallel. https://discuss.pytorch.org/t/model-parallelism-in-pytorch-for-large-r-than-1-gpu-models/778/2
Any help would be greatly appreciated
It feels that using
multiprocessing should work for you. Is there any problem with it?
multiprocessing will make it a bit hard to share a global dict (it is possible using managers, but it’s not very intuitive). You can try spawning a few threads first, and see if you are bottlenecked by Python’s GIL.
Thanks! Turns out multiprocess seems to be what I’m looking for. I checked out few tutorials regarding manager. Will give it a try.
I had a doubt though, would i need to specify GPUs or does the multiprocessing module automatically take care of distributing processes across it? Also, can I use the pool class? Will pytorch throw an error if I try to map to too many processes?
You will need to write your own function that runs the experiments. Just make sure that they each use a different GPU and it should be fine.
I’m reconsidering my strategy after coming across this though:
For now I’ll focus on multiprocessing without GPUs.
I thought you have multiple GPUs, one for each model. Am I wrong?
Sorry, my bad. I didn’t read the link properly.
Yes, I do have 1 GPU per model. I’m still confused about how to use multiprocessing with GPUs though. The links I’ve seen use something like
model = torch.nn.DataParallel(model, device_ids=[0,1]).cuda()
replicas = nn.parallel.replicate(module, device_ids)
outputs = nn.parallel.parallel_apply(replicas, inputs)
However what I want to do is run training functions in parallel and change the hyper parameters. Could you tell me how we can assign GPUs to functions and run them in parallel with multiprocess?