How to implement model parallelism?

In my program, I have to build two different model for training . However, my cuda memory would be overflowed directly . So I want to distribute training in different GPUs for different models. My final loss contains of two parts, one part is independent and other is joint.

Dose anybody give me some suggestions ?

2 Likes

Could you explain, how your independent and joint losses are created?
You could use something like this as a starter:

modelA = nn.Linear(10, 2).to('cuda:0')
modelB = nn.Linear(10, 2).to('cuda:1')

criterion = nn.CrossEntropyLoss()

optimizerA = optim.SGD(modelA.parameters(), lr=1e-3)
optimizerB = optim.SGD(modelB.parameters(), lr=1e-3)

for data, target in loader:
    # Get losses for separate models
    data, target = data.to('cuda:0'), target.to('cuda:0')
    output = modelA(data)
    lossA = criterion(output, target)
    
    data, target = data.to('cuda:1'), target.to('cuda:1')
    output = modelB(data)
    lossB = criterion(output, target)

I’m not sure how you would like to create the joint loss, i.e. just summing lossA and lossB wouldn’t change anything.

1 Like

Thanks in advance .

modelA = nn.Linear(10, 2).to('cuda:0')
modelB = nn.Linear(10, 2).to('cuda:1')

criterion = nn.CrossEntropyLoss()

optimizerA = optim.SGD(modelA.parameters(), lr=1e-3)
optimizerB = optim.SGD(modelB.parameters(), lr=1e-3)

for ((dataA, targetA), (dataB, targetB)) in zip(loader_A, loader_B):
    # Get losses for separate models
    dataA, targetA = dataA.to('cuda:0'), targetA.to('cuda:0')
    outputA = modelA(dataA)
    lossA = criterion(outputA, targetA)
    
    dataB, targetB = dataB.to('cuda:1'), targetB.to('cuda:1')
    outputB = modelB(dataB)
    lossB = criterion(output, target)

    lossC = torch.nn.CosineSimilarity(ouputA, outputB)
 
    final_loss = lossA + lossB + lossC

In above snippet, we aim to eliminate the discrepancy in between two independent systems. Therefore, we should jointly train them .

1 Like

In your code snippet you’ll most likely get an error stating some tensors are not on the same device.
Since outputA and outputB are on GPU0 and GPU1, respectively, you should push them to the same device.
Could you try the following:

...
lossC = torch.nn.CosineSimilarity(outputA, outputB.to('cuda:0')) # lossC is now on cuda:0
final_loss = lossA + lossB.to('cuda:0') + lossC
2 Likes

Yes , you are right . But the lossB have been moved to GPU 0. I wonder that could affect gradient of modelB in backward operation . Could modelB parameters be updated synchronously ?

Thanks for your reply again.

The gradients should lay on GPU1 for lossB, so it shouldn’t be a problem. You can check it with:

print(modelB.weight.grad.device)