GPU running out of memory in the middle of validation

Hi all, I’m working on a super-resolution CNN model and for some reason or another I’m running into GPU memory issues. I’m using the following training and validation loops in separate functions, and I am taking care to detach tensor data as appropriate, to prevent the computational graph from being replicated needlessly (as discussed in many other issues flagged in this forum):
Training Function:

    def run_train(self, x, y, *args, **kwargs):
        if self.eval_mode:
            raise RuntimeError('Model initialized in eval mode, training not possible.')
        self.net.train()  # sets model to training mode (activates appropriate procedures for certain layers)
        x, y = x.to(device=self.device), y.to(device=self.device)
        out = self.run_model(x, **kwargs)  # run data through model

        loss = self.criterion(out, y)   # compute loss

        self.optimizer.zero_grad()  # set all weight grads from previous training iters to 0
        loss.backward()  # backpropagate to compute gradients for current iter loss
        if self.grad_clip is not None:  # gradient clipping
            nn.utils.clip_grad_norm_(self.net.parameters(), self.grad_clip)
        self.optimizer.step()  # update network parameters

        if self.learning_rate_scheduler is not None:
            self.learning_rate_scheduler.step()

        return loss.detach().cpu().numpy()

Validation Function:

    def run_eval(self, x, y=None, request_loss=False, tag=None, *args, **kwargs):
        self.net.eval()  # sets the system to validation mode
        with torch.no_grad():
            x = x.to(device=self.device)
            out = self.run_model(x, image_names=tag, **kwargs)  # forward the data in the model
            if request_loss:
                y = y.to(device=self.device)
                loss = self.criterion(out, y).detach().cpu().numpy()  # compute loss
            else:
                loss = None
        return out.detach().cpu(), loss

For some reason, the GPU runs out of memory only in the middle of either the training run or in the middle of a validation run (i.e. after a number of images have already been tested/fed into the model without issue). This seems to be due to memory building up throughout validation/training. I have attempted to probe the issue by clearing the pytorch cache and deleting variables before exiting the function, but nothing seems to help this problem. This GPU buildup reaches a certain limit before stopping, and seems to be dependant on the training batch size (validation batch size is always 1). I.e. if I, for example, set the train batch size to 16, the GPU memory builds up to ~8GB during validation, then stops there for the remainder of the training run. If I set the batch size to 8, the GPU memory buildup stops at ~4GB, and sticks there (these are both hypothetical examples). This means that I have to severely limit my batch size in order to allow training to occur, which is too much of a tradeoff for me.

Do you have any further insight into what could be going wrong? Thanks for your help!

I’m not sure I understand the issue completely, since you were mentioning a memory increase in the first part, but also explain that the memory usage reaches a certain level for a smaller batch size.
Is this memory peak stable for the smaller batch sizes or does it also increase for multiple training and validation epochs?

Does the GPU memory build-up remain at those particular points i.e. 8 and 4 GB once they have reached upto those points, or do they increase further?

If you’re not already - you can try using nvidia-smi -l 1 on the terminal to continuously monitor the GPU memory usage

I apologize, I might have not been so clear in my explanation. The situation is this:

  • When I run a super-res model for a particular training batch size (say 4) with a constant validation batch size (always 1), the model slowly accumulates GPU memory through each batch of training and validation inputs until a certain stable limit (say 4GB), where it remains until the end of my training. This limit is normally achieved by the end of my first epoch (a single run-through all training input images and validation images).
  • When I increase the batch size (to say 8), this memory limit increases to, say, 8GB, which is also fine for an 11GB card. This is of course normal to consume more memory with a higher batch size.
  • However, when I increase the training batch size to 16, the memory limit error does not trigger during training, but actually during validation, where the batch size should not have changed at all. This means that technically I could achieve a batch size of 16 with my GPU, but is somehow being held back during the validation pass. I tried to flush the GPU memory using torch.cuda.empty_cache() after my training pass (right before my validation pass) but this seemingly wasn’t enough to prevent the error from occurring.
  • I do observe the GPU memory increasing when I send my model to the GPU, but I also observe that, after training, the GPU memory does not drop back to the value it had before training (with just the model loaded in). This is measured using nvidia-smi, after I run the empty cache command.