Hello everyone, hope y’all are doing good out there.
I was hoping if someone could really clear this doubt of mine.
I have a GPU configuration with 2 GPUs, namely, GPU0: RTX Quadro 8000 and GPU1: RTX A6000.
Now as you see my GPU1 is more powerful (almost twice when I benchmarked on RESNET50 for ImageNet dataset) than my GPU0.
But the problem is that DDP shares the load evenly among both the GPUs, which is good, but could be better in my case if I utilize each to their full extent. How do I do that exactly.
-
I first tried to send a bigger batch to GPU1 (say twice the batch size sent to GPU0), wherein I overwrite the Distributed Sampler class’s “iter” method to do so. But the problem is I don’t want this approach as both GPUs have the same VRAM and giving a bigger batch size to 1 GPU is not going to help me if I keep on increasing these batch sizes (in proportion to both GPUs).
-
This is what I want the community to focus on → I tried having double the data processed on GPU1 than GPU0 wherein the batch size will be kept the same but GPU1, lets say processes twice the batches when GPU0 processes 1 batch. For now I am having some if conditions to handle this, the code to which I am sharing below. Please tell me what changes I can do, and why ( as I am new to all of these topics).
-
If you guys have any other technique or idea/ approach to solve this, much appreciated, please share and we will have a good discussion on that, I think this will help a lot of people as this topic isn’t discussed much in detail.
Here’s the code:
class Trainer:
def __init__(
self,
model: torch.nn.Module,
train_data: DataLoader,
optimizer: torch.optim.Optimizer,
gpu_id: int,
save_every: int,
) -> None:
self.gpu_id = gpu_id
self.model = model.to(gpu_id)
self.train_data = train_data
self.optimizer = optimizer
self.save_every = save_every
self.model = DDP(model, device_ids=[gpu_id])
self.batch_for_rtx6000 = 0
def _run_batch(self, source, targets):
if self.gpu_id == 1 and self.batch_for_rtx6000 == 3:
self.optimizer.zero_grad()
self.batch_for_rtx6000 = 1
elif self.gpu_id == 0:
self.optimizer.zero_grad()
output = self.model(source)
loss = F.cross_entropy(output, targets)
loss.backward() # This step accumulates or sums the gradients
if self.gpu_id == 1 and self.batch_for_rtx6000 == 2:
self.optimizer.step()
elif self.gpu_id == 0:
self.optimizer.step()
def _run_epoch(self, epoch):
b_sz = len(next(iter(self.train_data))[0])
print(f"[GPU{self.gpu_id}] Epoch {epoch} | Batchsize: {b_sz} | Steps: {len(self.train_data)/b_sz}")
self.train_data.batch_sampler.set_epoch(epoch)
for source, targets in self.train_data:
source = source.to(self.gpu_id)
targets = targets.to(self.gpu_id)
if self.gpu_id == 1:
self.batch_for_rtx6000 += 1
self._run_batch(source, targets)
def _save_checkpoint(self, epoch):
ckp = self.model.module.state_dict()
PATH = "checkpoint.pt"
torch.save(ckp, PATH)
print(f"Epoch {epoch} | Training checkpoint saved at {PATH}")
def train(self, max_epochs: int):
for epoch in range(max_epochs):
print("......", self.gpu_id)
self._run_epoch(epoch)
if self.gpu_id == 0 and epoch % self.save_every == 0:
# print(self.model.linear.weights)
self._save_checkpoint(epoch)
print(self.model.linear.weights)
This is in my Trainer class (just an overview)