The bottleneck of my training routine is its data augmentation, which is “sufficiently” optimized. In order to speed-up hyperparameter search, I thought it’d be a good idea to train two models, each on another GPU, simultaneously using one dataloader.
As far as I understand, this could be seen as model parallel. However, my implementation failed.
Down below an example. After the first epoch, I expect the network weights to be identical. However, the loss1 is equal to loss2 just in the first iteration. Detaching and cloning the batch before moving it to the graphics cards didn’t change things.
torch.manual_seed(42) model1 = SomeModel() torch.manual_seed(42) model2 = SomeModel() dev1 = torch.device("cuda:0") dev2 = torch.device("cuda:1") o1 = torch.optim.AdamW(model1.parameters()) o2 = torch.optim.AdamW(model2.parameters()) l1 = SomeLoss() l2 = SomeLoss() model1 = model1.to(dev1) model2 = model2.to(dev2) for batch in train_loader: o1.zero_grad() o2.zero_grad() logits1 = model1(batch.to(dev1)) logits2 = model2(batch.to(dev2)) loss1 = l1(logits1) loss2 = l2(logits2) loss1.backward() loss2.backward() o1.step() o2.step()
Do you have any hints what’s going on? I suspect the computation graph to do funny things…
The system runs Debian 11.1, PyTorch 1.9.1 and Cuda 11.12.