Train multiple models on multiple GPUs

I think in your current implementation you would indeed have to wait until the optimization was done on each GPU.
If you just have two models, you could push each input and target tensor to the appropriate GPU and call the forward passes after each other.
Since these calls are performed asynchronously, you could achieve a speedup in this way.
The code should look like this:

input1 = input.to('cuda:0')
intput2 = input.to('cuda:1')
# same for label
optimizer1.zero_grad()
optimizer2.zero_grad()

outpu1 = model1(intput1) # should be an asynch call
outpu2 = model2(intput2)
...

Unfortunately I cannot test it at the moment. Would you run it and check if it’s suitable for your use case?