Training multiple models with one dataloader

Hi,

The bottleneck of my training routine is its data augmentation, which is “sufficiently” optimized. In order to speed-up hyperparameter search, I thought it’d be a good idea to train two models, each on another GPU, simultaneously using one dataloader.

As far as I understand, this could be seen as model parallel. However, my implementation failed.

Down below an example. After the first epoch, I expect the network weights to be identical. However, the loss1 is equal to loss2 just in the first iteration. Detaching and cloning the batch before moving it to the graphics cards didn’t change things.

    torch.manual_seed(42)
    model1 = SomeModel()
    torch.manual_seed(42)
    model2 = SomeModel()

    dev1 = torch.device("cuda:0")
    dev2 = torch.device("cuda:1")

    o1 = torch.optim.AdamW(model1.parameters())
    o2 = torch.optim.AdamW(model2.parameters())

    l1 = SomeLoss()
    l2 = SomeLoss()

    model1 = model1.to(dev1)
    model2 = model2.to(dev2)

    for batch in train_loader:
        o1.zero_grad()
        o2.zero_grad()
        logits1 = model1(batch.to(dev1))
        logits2 = model2(batch.to(dev2))
        loss1 = l1(logits1)
        loss2 = l2(logits2)
        loss1.backward()
        loss2.backward()
        o1.step()
        o2.step()

Do you have any hints what’s going on? I suspect the computation graph to do funny things…

Edit:
The system runs Debian 11.1, PyTorch 1.9.1 and Cuda 11.12.

Far apart or really close?

I consider them far apart (weights differ up to the second decimal after a couple of iterations).

You don’t have dropout in your models, right?
This could be also be due to numerical precision and nondeterminism, it’s hard to tell with the information at hand. One indication of this would be if you cannot pinpoint where they differ. Otherwise you could compare after the first iteration and find which forward activations or gradients differ.

Thanks for your reply!!

There’s no dropout in my models. However, I’ve re-run the code with a model consisting of a single linear layer. Surprisingly, it works.
My model consists of conv, batch/instance norm, ReLU, AdaptiveAveragePooling, MaxPooling and linear layers, including skip connections. It’s essentially a ResNet.

Again, I really appreciate your feedback!

Edit:
Just noticed that the gradients of the input layer are different right from the first iteration. The maximum difference between the gradients of that layer is 8.5e^-5.

That likely is numerical precision. You could try to use double (experimentally) and see if the difference gets smaller.

Surprisingly, using double precision for models and inputs results in a bigger maximum difference in gradients: 1.3e^-4.

So I came up with this minimal example, consisting of just a Conv Layer and a Linear layer. The code exits after the first iteration.

import torch
import torchvision
import torch.nn as nn
import copy

train_loader = torch.utils.data.DataLoader(
  torchvision.datasets.MNIST("./", train=True, download=True,
                             transform=torchvision.transforms.Compose([
                               torchvision.transforms.ToTensor(),
                               torchvision.transforms.Normalize(
                                 (0.1307,), (0.3081,))
                             ])),
  batch_size=4096, shuffle=True)

class Net(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.encoder = nn.Conv2d(1, 8, 3, stride=1, padding=1)
        self.head = nn.Linear(6272, 10)

    def forward(self, image):
        out = self.encoder(image)
        out = out.flatten(start_dim=1)
        print(out.shape)
        return self.head(out)

torch.manual_seed(42)
model1 = Net()
torch.manual_seed(42)
model2 = Net()
assert all([torch.equal(x[1], y[1]) for x, y in zip(model1.state_dict().items(), model2.state_dict().items())])


optim1 = torch.optim.AdamW(model1.parameters())
optim2 = torch.optim.AdamW(model2.parameters())
loss1 = nn.MSELoss()
loss2 = nn.MSELoss()
dev1 = torch.device("cuda:0")
dev2 = torch.device("cuda:1")
cpu = torch.device("cpu")

model1 = model1.to(dev1)
model2 = model2.to(dev2)

model1.train()
model2.train()

for i, (images, targets) in enumerate(train_loader):
    batch1 = copy.deepcopy(images).to(dev1)
    batch2 = copy.deepcopy(images).to(dev2)

    t1 = copy.deepcopy(targets).to(dev1)
    t2 = copy.deepcopy(targets).to(dev2)
    t1 = nn.functional.one_hot(t1, num_classes=10).float()
    t2 = nn.functional.one_hot(t2, num_classes=10).float()

    optim1.zero_grad()
    result1 = model1.forward(batch1)
    l1 = loss1(result1, t1)
    l1.backward()
    optim1.step()

    optim2.zero_grad()
    result2 = model2.forward(batch2)
    l2 = loss2(result2, t2)
    l2.backward()
    optim2.step()

    if not (model1.to(cpu).encoder.weight.grad == model2.to(cpu).encoder.weight.grad).all():
        print(f"Nope nope nope - Chuck Testa!\n @Iteration {i}")
        break
    else:
        model1 = model1.to(dev1)
        model2 = model2.to(dev2)
        model1.train()
        model2.train()

Yeah, but I get (running both nets on the same device) an error that is 1e-8ish, which seems to be within numerical precision.
When disabling cudnn , the error goes to 0 but I don’t know what exactly it is.

1 Like

Yeah, that’s it! Thanks a lot for checking it out.

I experimented with different settings of cudnn and found that calling

torch.backends.cudnn.deterministic = True

was sufficient to solve the issue.

Some additional info with respect to runtime per batch for future readers (ii and iii solve the issue):

i: default settings (i.e. non-deterministic)
------> 0.51s
ii: torch.backends.cudnn.enabled = False
------> 0.14s
iii: torch.backends.cudnn.deterministic=True
------> 0.002s

Note the speed-up for this model. Reasons for speed-up of deterministic algorithms was discussed here.

Hi, I just use 8x3090 to train 8 models with one dataloader.
However, the fp and bp of 8 models seem to be not running in parallel.
Do you guys know how to accelerate the training?

How are you parallelizing the workload? DistributedDataParallel would already run the forward and backward in parallel.

Thanks, I follow the the owner of this post to implement 8 models on 8 GPUs.
I try to use thread to accelerate, however it seems even slower.

Show the code please or follow DDP tutorial Getting Started with Distributed Data Parallel — PyTorch Tutorials 1.11.0+cu102 documentation
If you need to split the dataloader, please try DistributedSampler - https://medium.com/codex/a-comprehensive-tutorial-to-pytorch-distributeddataparallel-1f4b42bb1b51

As I understand, that would split the data within dataloader evenly among the N model replicas.

What when one wish to process the same data on the N different model replicas (one per GPU)?

Thank you.

You could implement your custom sampler or just use the default sampler (note that you might want to seed the code in case you are shuffling the dataset).
However, I don’t fully understand your use case since you would just repeat the same operation n_gpus times. This wouldn’t be considered a distributed data parallel training since each forward/backward would create identical results, wouldn’t it?