Training multiple models with one dataloader

Hi,

The bottleneck of my training routine is its data augmentation, which is “sufficiently” optimized. In order to speed-up hyperparameter search, I thought it’d be a good idea to train two models, each on another GPU, simultaneously using one dataloader.

As far as I understand, this could be seen as model parallel. However, my implementation failed.

Down below an example. After the first epoch, I expect the network weights to be identical. However, the loss1 is equal to loss2 just in the first iteration. Detaching and cloning the batch before moving it to the graphics cards didn’t change things.

    torch.manual_seed(42)
    model1 = SomeModel()
    torch.manual_seed(42)
    model2 = SomeModel()

    dev1 = torch.device("cuda:0")
    dev2 = torch.device("cuda:1")

    o1 = torch.optim.AdamW(model1.parameters())
    o2 = torch.optim.AdamW(model2.parameters())

    l1 = SomeLoss()
    l2 = SomeLoss()

    model1 = model1.to(dev1)
    model2 = model2.to(dev2)

    for batch in train_loader:
        o1.zero_grad()
        o2.zero_grad()
        logits1 = model1(batch.to(dev1))
        logits2 = model2(batch.to(dev2))
        loss1 = l1(logits1)
        loss2 = l2(logits2)
        loss1.backward()
        loss2.backward()
        o1.step()
        o2.step()

Do you have any hints what’s going on? I suspect the computation graph to do funny things…

Edit:
The system runs Debian 11.1, PyTorch 1.9.1 and Cuda 11.12.

Far apart or really close?

I consider them far apart (weights differ up to the second decimal after a couple of iterations).

You don’t have dropout in your models, right?
This could be also be due to numerical precision and nondeterminism, it’s hard to tell with the information at hand. One indication of this would be if you cannot pinpoint where they differ. Otherwise you could compare after the first iteration and find which forward activations or gradients differ.

Thanks for your reply!!

There’s no dropout in my models. However, I’ve re-run the code with a model consisting of a single linear layer. Surprisingly, it works.
My model consists of conv, batch/instance norm, ReLU, AdaptiveAveragePooling, MaxPooling and linear layers, including skip connections. It’s essentially a ResNet.

Again, I really appreciate your feedback!

Edit:
Just noticed that the gradients of the input layer are different right from the first iteration. The maximum difference between the gradients of that layer is 8.5e^-5.

That likely is numerical precision. You could try to use double (experimentally) and see if the difference gets smaller.

Surprisingly, using double precision for models and inputs results in a bigger maximum difference in gradients: 1.3e^-4.

So I came up with this minimal example, consisting of just a Conv Layer and a Linear layer. The code exits after the first iteration.

import torch
import torchvision
import torch.nn as nn
import copy

train_loader = torch.utils.data.DataLoader(
  torchvision.datasets.MNIST("./", train=True, download=True,
                             transform=torchvision.transforms.Compose([
                               torchvision.transforms.ToTensor(),
                               torchvision.transforms.Normalize(
                                 (0.1307,), (0.3081,))
                             ])),
  batch_size=4096, shuffle=True)

class Net(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.encoder = nn.Conv2d(1, 8, 3, stride=1, padding=1)
        self.head = nn.Linear(6272, 10)

    def forward(self, image):
        out = self.encoder(image)
        out = out.flatten(start_dim=1)
        print(out.shape)
        return self.head(out)

torch.manual_seed(42)
model1 = Net()
torch.manual_seed(42)
model2 = Net()
assert all([torch.equal(x[1], y[1]) for x, y in zip(model1.state_dict().items(), model2.state_dict().items())])


optim1 = torch.optim.AdamW(model1.parameters())
optim2 = torch.optim.AdamW(model2.parameters())
loss1 = nn.MSELoss()
loss2 = nn.MSELoss()
dev1 = torch.device("cuda:0")
dev2 = torch.device("cuda:1")
cpu = torch.device("cpu")

model1 = model1.to(dev1)
model2 = model2.to(dev2)

model1.train()
model2.train()

for i, (images, targets) in enumerate(train_loader):
    batch1 = copy.deepcopy(images).to(dev1)
    batch2 = copy.deepcopy(images).to(dev2)

    t1 = copy.deepcopy(targets).to(dev1)
    t2 = copy.deepcopy(targets).to(dev2)
    t1 = nn.functional.one_hot(t1, num_classes=10).float()
    t2 = nn.functional.one_hot(t2, num_classes=10).float()

    optim1.zero_grad()
    result1 = model1.forward(batch1)
    l1 = loss1(result1, t1)
    l1.backward()
    optim1.step()

    optim2.zero_grad()
    result2 = model2.forward(batch2)
    l2 = loss2(result2, t2)
    l2.backward()
    optim2.step()

    if not (model1.to(cpu).encoder.weight.grad == model2.to(cpu).encoder.weight.grad).all():
        print(f"Nope nope nope - Chuck Testa!\n @Iteration {i}")
        break
    else:
        model1 = model1.to(dev1)
        model2 = model2.to(dev2)
        model1.train()
        model2.train()

Yeah, but I get (running both nets on the same device) an error that is 1e-8ish, which seems to be within numerical precision.
When disabling cudnn , the error goes to 0 but I don’t know what exactly it is.

1 Like

Yeah, that’s it! Thanks a lot for checking it out.

I experimented with different settings of cudnn and found that calling

torch.backends.cudnn.deterministic = True

was sufficient to solve the issue.

Some additional info with respect to runtime per batch for future readers (ii and iii solve the issue):

i: default settings (i.e. non-deterministic)
------> 0.51s
ii: torch.backends.cudnn.enabled = False
------> 0.14s
iii: torch.backends.cudnn.deterministic=True
------> 0.002s

Note the speed-up for this model. Reasons for speed-up of deterministic algorithms was discussed here.