How to make one GPU work more than the other in DDP if the GPUs performance not the same? (Please answer this its really a unique question)

Hello everyone, hope y’all are doing good out there.

I was hoping if someone could really clear this doubt of mine.

I have a GPU configuration with 2 GPUs, namely, GPU0: RTX Quadro 8000 and GPU1: RTX A6000.

Now as you see my GPU1 is more powerful (almost twice when I benchmarked on RESNET50 for ImageNet dataset) than my GPU0.

But the problem is that DDP shares the load evenly among both the GPUs, which is good, but could be better in my case if I utilize each to their full extent. How do I do that exactly.

  • I first tried to send a bigger batch to GPU1 (say twice the batch size sent to GPU0), wherein I overwrite the Distributed Sampler class’s “iter” method to do so. But the problem is I don’t want this approach as both GPUs have the same VRAM and giving a bigger batch size to 1 GPU is not going to help me if I keep on increasing these batch sizes (in proportion to both GPUs).

  • This is what I want the community to focus on → I tried having double the data processed on GPU1 than GPU0 wherein the batch size will be kept the same but GPU1, lets say processes twice the batches when GPU0 processes 1 batch. For now I am having some if conditions to handle this, the code to which I am sharing below. Please tell me what changes I can do, and why ( as I am new to all of these topics).

  • If you guys have any other technique or idea/ approach to solve this, much appreciated, please share and we will have a good discussion on that, I think this will help a lot of people as this topic isn’t discussed much in detail.

Here’s the code:

class Trainer:
    def __init__(
            self,
            model: torch.nn.Module,
            train_data: DataLoader,
            optimizer: torch.optim.Optimizer,
            gpu_id: int,
            save_every: int,
    ) -> None:
        self.gpu_id = gpu_id
        self.model = model.to(gpu_id)
        self.train_data = train_data
        self.optimizer = optimizer
        self.save_every = save_every
        self.model = DDP(model, device_ids=[gpu_id])
        self.batch_for_rtx6000 = 0

    def _run_batch(self, source, targets):
        if self.gpu_id == 1 and self.batch_for_rtx6000 == 3:
            self.optimizer.zero_grad()
            self.batch_for_rtx6000 = 1
        elif self.gpu_id == 0:
            self.optimizer.zero_grad()
        output = self.model(source)
        loss = F.cross_entropy(output, targets)
        loss.backward()  # This step accumulates or sums the gradients
        if self.gpu_id == 1 and self.batch_for_rtx6000 == 2:
            self.optimizer.step()
        elif self.gpu_id == 0:
            self.optimizer.step()

    def _run_epoch(self, epoch):
        b_sz = len(next(iter(self.train_data))[0])
        print(f"[GPU{self.gpu_id}] Epoch {epoch} | Batchsize: {b_sz} | Steps: {len(self.train_data)/b_sz}")
        self.train_data.batch_sampler.set_epoch(epoch)
        for source, targets in self.train_data:
            source = source.to(self.gpu_id)
            targets = targets.to(self.gpu_id)
            if self.gpu_id == 1:
                self.batch_for_rtx6000 += 1
            self._run_batch(source, targets)

    def _save_checkpoint(self, epoch):
        ckp = self.model.module.state_dict()
        PATH = "checkpoint.pt"
        torch.save(ckp, PATH)
        print(f"Epoch {epoch} | Training checkpoint saved at {PATH}")

    def train(self, max_epochs: int):
        for epoch in range(max_epochs):
            print("......", self.gpu_id)
            self._run_epoch(epoch)
            if self.gpu_id == 0 and epoch % self.save_every == 0:
                # print(self.model.linear.weights)
                self._save_checkpoint(epoch)
        print(self.model.linear.weights)

This is in my Trainer class (just an overview)

I don’t know what the best approach would be, but you could try to use torch.nn.parallel.DistributedDataParallel.no_sync() on the faster rank for an additional gradient accumulation before DDP synchronizes the gradients.
Note that gradients will be accumulated so you might need to average them.

Thank You so much sir! I will definitely try this and will respond to this forum if I get any fruitful result.