How to make one GPU work more than the other in DDP if the GPUs performance not the same? (Please answer this its really a unique question)

Maitreya_Kocharekar · April 29, 2024, 2:50pm

Hello everyone, hope y’all are doing good out there.

I was hoping if someone could really clear this doubt of mine.

I have a GPU configuration with 2 GPUs, namely, GPU0: RTX Quadro 8000 and GPU1: RTX A6000.

Now as you see my GPU1 is more powerful (almost twice when I benchmarked on RESNET50 for ImageNet dataset) than my GPU0.

But the problem is that DDP shares the load evenly among both the GPUs, which is good, but could be better in my case if I utilize each to their full extent. How do I do that exactly.

I first tried to send a bigger batch to GPU1 (say twice the batch size sent to GPU0), wherein I overwrite the Distributed Sampler class’s “iter” method to do so. But the problem is I don’t want this approach as both GPUs have the same VRAM and giving a bigger batch size to 1 GPU is not going to help me if I keep on increasing these batch sizes (in proportion to both GPUs).
This is what I want the community to focus on → I tried having double the data processed on GPU1 than GPU0 wherein the batch size will be kept the same but GPU1, lets say processes twice the batches when GPU0 processes 1 batch. For now I am having some if conditions to handle this, the code to which I am sharing below. Please tell me what changes I can do, and why ( as I am new to all of these topics).
If you guys have any other technique or idea/ approach to solve this, much appreciated, please share and we will have a good discussion on that, I think this will help a lot of people as this topic isn’t discussed much in detail.

Here’s the code:

class Trainer:
    def __init__(
            self,
            model: torch.nn.Module,
            train_data: DataLoader,
            optimizer: torch.optim.Optimizer,
            gpu_id: int,
            save_every: int,
    ) -> None:
        self.gpu_id = gpu_id
        self.model = model.to(gpu_id)
        self.train_data = train_data
        self.optimizer = optimizer
        self.save_every = save_every
        self.model = DDP(model, device_ids=[gpu_id])
        self.batch_for_rtx6000 = 0

    def _run_batch(self, source, targets):
        if self.gpu_id == 1 and self.batch_for_rtx6000 == 3:
            self.optimizer.zero_grad()
            self.batch_for_rtx6000 = 1
        elif self.gpu_id == 0:
            self.optimizer.zero_grad()
        output = self.model(source)
        loss = F.cross_entropy(output, targets)
        loss.backward()  # This step accumulates or sums the gradients
        if self.gpu_id == 1 and self.batch_for_rtx6000 == 2:
            self.optimizer.step()
        elif self.gpu_id == 0:
            self.optimizer.step()

    def _run_epoch(self, epoch):
        b_sz = len(next(iter(self.train_data))[0])
        print(f"[GPU{self.gpu_id}] Epoch {epoch} | Batchsize: {b_sz} | Steps: {len(self.train_data)/b_sz}")
        self.train_data.batch_sampler.set_epoch(epoch)
        for source, targets in self.train_data:
            source = source.to(self.gpu_id)
            targets = targets.to(self.gpu_id)
            if self.gpu_id == 1:
                self.batch_for_rtx6000 += 1
            self._run_batch(source, targets)

    def _save_checkpoint(self, epoch):
        ckp = self.model.module.state_dict()
        PATH = "checkpoint.pt"
        torch.save(ckp, PATH)
        print(f"Epoch {epoch} | Training checkpoint saved at {PATH}")

    def train(self, max_epochs: int):
        for epoch in range(max_epochs):
            print("......", self.gpu_id)
            self._run_epoch(epoch)
            if self.gpu_id == 0 and epoch % self.save_every == 0:
                # print(self.model.linear.weights)
                self._save_checkpoint(epoch)
        print(self.model.linear.weights)

This is in my Trainer class (just an overview)

ptrblck · May 5, 2024, 4:27pm

I don’t know what the best approach would be, but you could try to use torch.nn.parallel.DistributedDataParallel.no_sync() on the faster rank for an additional gradient accumulation before DDP synchronizes the gradients.
Note that gradients will be accumulated so you might need to average them.

Maitreya_Kocharekar · May 6, 2024, 12:39pm

Thank You so much sir! I will definitely try this and will respond to this forum if I get any fruitful result.

Maitreya_Kocharekar · June 10, 2024, 8:47pm

Hello Patrick! I hope you are doing well. I got another workaround to this problem but I am having some knowledge issues implementing it. Hope I could get your help in this.

The solution to this problem being, instead of me trying to play directly with the dataloader, deciding batch sizes and all those stuff, why not I do this. The dataloader will kind of be in the shared memory and the processes that exist in DDP in each of my GPUs will request from the dataloader thier respective batches.

My thought is that is it possible for me to place iterator on my dataloader in the shared memory, and then like always pass this iterator to each of the GPU process. This way whenever the GPU requests for data from the dataloader, it is basically doing a next() on the object and fetching he data, if its a faster GPU, it will keep on fetching this data whenever its done and the other GPU will also do the same, but here I dont have to decide how much batches do I need to send them which was basically causing sync issues for me.

Do I need to be more clear or did you understand this kind of workaround. This was kind of done by one guy on the forum but I dont know how do I implement like him coz shared memory asks for a tensor but I instead have an iterator to be shared.
( How to share data among DataLoader processes to save memory - Memory Format - PyTorch Forums )

Right now whenever I create these processes, two different iterators are getting created.

( I am naive to all of this so my apologies in advance if I make some mistake while describing)