It should work! You have to make sure the Variables/Tensors are located on the right GPU.
Could you explain a bit more about your use case?
Are you merging the outputs somehow or are the models completely independent from each other?
Hi ptrblck, thanks for your reply. The models are completely independent from each other but in some training steps, the models would transfer information between each other. So I need to train these models simultaneously. BTW, if I want to train all the models simultaneously, how do I write the code? Currently, my code is like the following, but I guess the models are trained in a sequential manner,
model1 = model1.cuda(0)
model2 = model2.cuda(1)
models = [model1, model2]
for (input, label) in data_loader:
for m in models:
m.train()
optimizer.zero_grad()
output = m(input)
loss = criterion(output, label)
loss.backward()
optimizer.step()
I think in your current implementation you would indeed have to wait until the optimization was done on each GPU.
If you just have two models, you could push each input and target tensor to the appropriate GPU and call the forward passes after each other.
Since these calls are performed asynchronously, you could achieve a speedup in this way.
The code should look like this:
input1 = input.to('cuda:0')
intput2 = input.to('cuda:1')
# same for label
optimizer1.zero_grad()
optimizer2.zero_grad()
outpu1 = model1(intput1) # should be an asynch call
outpu2 = model2(intput2)
...
Unfortunately I cannot test it at the moment. Would you run it and check if it’s suitable for your use case?
Hi !
I am still interested in the topic. I am very new to Pytorch and currently would like to perform parallel training of different models on different GPUs (i.e. one model/GPU) for hyperparameter search or simply to get results for different weight initializations. I know there is a lot of documentation pertaining to multiprocessing and existing frameworks for hyperparameter tuning which I already checked, however I only have a limited amount of time and thus on the look out for the very simplest way to achieve this. It would be extremely helpful, thank you for your attention.
Hi finally I used the lib “mpi4py” to implement this. With MPI, you can assign each rank to train one model on one GPU. Also, MPI supports communication across ranks with which you can implement some special operations.
MPI is not necessary here, torch.distributed package now provides MPI style and rpc style distributed apis. Moreover it also supports gloompi and nccl backends (MPI style only), so if you don’t want more hassles, they should be sufficient.