How to implement model parallelism?

KelleyYin · November 11, 2018, 7:43am

In my program, I have to build two different model for training . However, my cuda memory would be overflowed directly . So I want to distribute training in different GPUs for different models. My final loss contains of two parts, one part is independent and other is joint.

Dose anybody give me some suggestions ?

ptrblck · November 11, 2018, 12:04pm

Could you explain, how your independent and joint losses are created?
You could use something like this as a starter:

modelA = nn.Linear(10, 2).to('cuda:0')
modelB = nn.Linear(10, 2).to('cuda:1')

criterion = nn.CrossEntropyLoss()

optimizerA = optim.SGD(modelA.parameters(), lr=1e-3)
optimizerB = optim.SGD(modelB.parameters(), lr=1e-3)

for data, target in loader:
    # Get losses for separate models
    data, target = data.to('cuda:0'), target.to('cuda:0')
    output = modelA(data)
    lossA = criterion(output, target)
    
    data, target = data.to('cuda:1'), target.to('cuda:1')
    output = modelB(data)
    lossB = criterion(output, target)

I’m not sure how you would like to create the joint loss, i.e. just summing lossA and lossB wouldn’t change anything.

KelleyYin · November 11, 2018, 12:25pm

Thanks in advance .

modelA = nn.Linear(10, 2).to('cuda:0')
modelB = nn.Linear(10, 2).to('cuda:1')

criterion = nn.CrossEntropyLoss()

optimizerA = optim.SGD(modelA.parameters(), lr=1e-3)
optimizerB = optim.SGD(modelB.parameters(), lr=1e-3)

for ((dataA, targetA), (dataB, targetB)) in zip(loader_A, loader_B):
    # Get losses for separate models
    dataA, targetA = dataA.to('cuda:0'), targetA.to('cuda:0')
    outputA = modelA(dataA)
    lossA = criterion(outputA, targetA)
    
    dataB, targetB = dataB.to('cuda:1'), targetB.to('cuda:1')
    outputB = modelB(dataB)
    lossB = criterion(output, target)

    lossC = torch.nn.CosineSimilarity(ouputA, outputB)
 
    final_loss = lossA + lossB + lossC

In above snippet, we aim to eliminate the discrepancy in between two independent systems. Therefore, we should jointly train them .

ptrblck · November 11, 2018, 9:10pm

In your code snippet you’ll most likely get an error stating some tensors are not on the same device.
Since outputA and outputB are on GPU0 and GPU1, respectively, you should push them to the same device.
Could you try the following:

...
lossC = torch.nn.CosineSimilarity(outputA, outputB.to('cuda:0')) # lossC is now on cuda:0
final_loss = lossA + lossB.to('cuda:0') + lossC

KelleyYin · November 12, 2018, 7:48am

Yes , you are right . But the lossB have been moved to GPU 0. I wonder that could affect gradient of modelB in backward operation . Could modelB parameters be updated synchronously ?

Thanks for your reply again.

ptrblck · November 12, 2018, 11:47am

The gradients should lay on GPU1 for lossB, so it shouldn’t be a problem. You can check it with:

print(modelB.weight.grad.device)