Increased GPU memory usage on GPU 0 when using AMP

When using AMP, there is a large GPU memory increase on GPU 0. Below is the pseudo code to show my training pipeline. Before running self.scaler.update(), 19G memory is used on each of the 8 GPUs. After the line is done, GPU 0 memory usage will be 29G. This happens in the first training iteration. Is it normal? Or is there any method to distribute the extra memory usage evenly to all the GPUs?

More info:

  • I am using DistributedDataParallel.
  • Pretrained checkpoint is loaded to different GPUs.
class Trainer:
    def __init__():
        self.optimizer = ...
        self.scaler = GradScaler()

    def train(input):
        output = model(input)
        loss = get_loss(output)
        self.scaler.scale(loss).backward()
        self.scaler.step(self.optimizer)
        self.scaler.update()
        self.optimizer.zero_grad()

How did you measure the memory? Is the increase purely in the cache or in the allocated memory? Were you seeing OOM issues when using AMP?

Hi, thanks for your reply. I double-checked the code and find it unrelated to amp or grad scaler. Instead, it is related distributed data loader. Below is the code that causes the problem. Please see the comments on the gpu memory usage. If I use random input data instead of the data loader, then all the GPUs will have the same GPU usage (12.9G). I am wondering why distributed data loader introduces extra memory usage on GPU 0.

class Trainer:
    def __init__(self, rank, world_size):
        self.init_distributed(rank, world_size)
        self.init_model()
        self.init_data()
        self.train()

    def init_distributed(self, rank, world_size):
        self.rank = rank
        self.world_size = world_size
        self.log('Initializing distributed')
        os.environ['MASTER_ADDR'] = self.args.distributed_addr
        os.environ['MASTER_PORT'] = self.args.distributed_port
        dist.init_process_group("nccl", rank=rank, world_size=world_size)

    def init_model(self):
        self.model = Model().to(self.rank)
        self.model_ddp = DDP(self.model, device_ids=[self.rank], broadcast_buffers=False, find_unused_parameters=True)
        self.optimizer = SGD([{'params': self.model_ddp.parameters(), 'lr': self.args.learning_rate_backbone}])

    def init_data(self):
        self.dataset = Data()
        self.sampler = DistributedSampler(dataset=self.dataset)
        self.dataloader = DataLoader(self.dataset, batch_size=self.batch_size, sampler=self.sampler)

    def train(self):
        for epoch in range(0, 100):

            # for i in range(100):
            #     input = torch.rand([1, 3, 512, 512])
            #     gt = torch.rand([1, 3, 512, 512])
            for input, gt in self.dataloader:
                input = input.to(self.rank, non_blocking=True)
                gt = gt.to(self.rank, non_blocking=True)

                pred = self.model_ddp(input)
                loss = compute_loss(pred, gt)
                loss.backward()
                # Distributed training on 8 GPUS
                # Here, all the GPUs have the same GPU memory usage (12.9GB).
                self.optimizer.step()
                # Here, GPU 0 has 16.1GB usage, while the others still have 12.9GB.
                self.optimizer.zero_grad()
                # Here, GPU 0 has 17.2GB usage, while the others still have 12.9GB.

I don’t know what might be causing the imbalanced memory usage, as you are pointing towards the DistributedSampler being the potential root cause while the code snippet shows the memory increase is caused after the optimizer.step() method.
Assuming the optimizer.step() method does indeed increase the memory on the default device only, it could mean that the optimizer is storing its internal running stats only on this device which wouldn’t fit the DDP use case. In case you are using nn.DataParallel, an imbalance is expected but then you wouldn’t use a DistributedSampler.

1 Like