GPU running out of memory in the middle of validation

mattaq31 · May 15, 2021, 10:32am

Hi all, I’m working on a super-resolution CNN model and for some reason or another I’m running into GPU memory issues. I’m using the following training and validation loops in separate functions, and I am taking care to detach tensor data as appropriate, to prevent the computational graph from being replicated needlessly (as discussed in many other issues flagged in this forum):
Training Function:

    def run_train(self, x, y, *args, **kwargs):
        if self.eval_mode:
            raise RuntimeError('Model initialized in eval mode, training not possible.')
        self.net.train()  # sets model to training mode (activates appropriate procedures for certain layers)
        x, y = x.to(device=self.device), y.to(device=self.device)
        out = self.run_model(x, **kwargs)  # run data through model

        loss = self.criterion(out, y)   # compute loss

        self.optimizer.zero_grad()  # set all weight grads from previous training iters to 0
        loss.backward()  # backpropagate to compute gradients for current iter loss
        if self.grad_clip is not None:  # gradient clipping
            nn.utils.clip_grad_norm_(self.net.parameters(), self.grad_clip)
        self.optimizer.step()  # update network parameters

        if self.learning_rate_scheduler is not None:
            self.learning_rate_scheduler.step()

        return loss.detach().cpu().numpy()

Validation Function:

    def run_eval(self, x, y=None, request_loss=False, tag=None, *args, **kwargs):
        self.net.eval()  # sets the system to validation mode
        with torch.no_grad():
            x = x.to(device=self.device)
            out = self.run_model(x, image_names=tag, **kwargs)  # forward the data in the model
            if request_loss:
                y = y.to(device=self.device)
                loss = self.criterion(out, y).detach().cpu().numpy()  # compute loss
            else:
                loss = None
        return out.detach().cpu(), loss

For some reason, the GPU runs out of memory only in the middle of either the training run or in the middle of a validation run (i.e. after a number of images have already been tested/fed into the model without issue). This seems to be due to memory building up throughout validation/training. I have attempted to probe the issue by clearing the pytorch cache and deleting variables before exiting the function, but nothing seems to help this problem. This GPU buildup reaches a certain limit before stopping, and seems to be dependant on the training batch size (validation batch size is always 1). I.e. if I, for example, set the train batch size to 16, the GPU memory builds up to ~8GB during validation, then stops there for the remainder of the training run. If I set the batch size to 8, the GPU memory buildup stops at ~4GB, and sticks there (these are both hypothetical examples). This means that I have to severely limit my batch size in order to allow training to occur, which is too much of a tradeoff for me.

Do you have any further insight into what could be going wrong? Thanks for your help!

ptrblck · May 16, 2021, 5:54am

I’m not sure I understand the issue completely, since you were mentioning a memory increase in the first part, but also explain that the memory usage reaches a certain level for a smaller batch size.
Is this memory peak stable for the smaller batch sizes or does it also increase for multiple training and validation epochs?

Megh_Bhalerao · May 16, 2021, 6:15am

Does the GPU memory build-up remain at those particular points i.e. 8 and 4 GB once they have reached upto those points, or do they increase further?

If you’re not already - you can try using nvidia-smi -l 1 on the terminal to continuously monitor the GPU memory usage

mattaq31 · May 16, 2021, 8:44am

I apologize, I might have not been so clear in my explanation. The situation is this:

When I run a super-res model for a particular training batch size (say 4) with a constant validation batch size (always 1), the model slowly accumulates GPU memory through each batch of training and validation inputs until a certain stable limit (say 4GB), where it remains until the end of my training. This limit is normally achieved by the end of my first epoch (a single run-through all training input images and validation images).
When I increase the batch size (to say 8), this memory limit increases to, say, 8GB, which is also fine for an 11GB card. This is of course normal to consume more memory with a higher batch size.
However, when I increase the training batch size to 16, the memory limit error does not trigger during training, but actually during validation, where the batch size should not have changed at all. This means that technically I could achieve a batch size of 16 with my GPU, but is somehow being held back during the validation pass. I tried to flush the GPU memory using torch.cuda.empty_cache() after my training pass (right before my validation pass) but this seemingly wasn’t enough to prevent the error from occurring.
I do observe the GPU memory increasing when I send my model to the GPU, but I also observe that, after training, the GPU memory does not drop back to the value it had before training (with just the model loaded in). This is measured using nvidia-smi, after I run the empty cache command.

Robke · April 2, 2024, 12:01pm

I observe the same thing and am surprised that no-one has responded to this thread in the meanwhile.
Apparently, Pytorch by default does start the validation run already before the training run is finished. This unnecessarily increases used GPU memory.
Is this also what you are experiencing @mattaq31 ?
Since all limitations on GPUs are related to memory for training large models, you would expect Pytorch to take specific care to not keep using memory without an actual use.
At least I would expect that all memory will be cleaned completely when switching between training and validation runs, as this only happens once every few minuts/hours/days (depending on the model and dataset size).

Is there a way now to do this more carefully?
Model sizes are increasing way more rapidly than GPU memory sizes are increasing, so it would be nice to think about this a bit more…

Robke · April 2, 2024, 1:48pm

Found that when using Pytorch Lightning, there are two parameters that control the early start of the validation (which obviously causes the GPU memory to load both train/val data):

val_check_interval
check_val_every_n_epoch

see: Speed Up Model Training — PyTorch Lightning 2.2.1 documentation

ptrblck · April 2, 2024, 7:46pm

That’s not the case since by default all CUDA operations will be enqueued and executed in the default stream. PyTorch does not start the validation run before the Python script reaches it.

PyTorch uses an internal CUDA caching mechanism and is able to reuse the GPU memory directly after it is released. I guess you see a different behavior on the host using Python’s garbage collector?

Robke · April 4, 2024, 2:06pm

The training and validation runs are automated in Pytorch Lightning, but I guess I could probably try it using by implementing one of the callbacks for start of validation run.
I did not notice any slow increase of GPU memory though, that was reported by @mattaq31.
My practical issue was resolved as mentioned by starting the validation run completely after each training epoch (Pytorch Lightning configuration thingy).
What do you refer to explictly with “Python’s garbage collector”?
I am probably not able to delete objects (they are all managed by Pytorch Lightning), but I was hoping there was a kind of memory-free function in Pytorch/Cuda that enables all gradient information of training epochs to be removed as to free GPU memory for the validation run. But as said, not running into practical issues anymore now.

ptrblck · April 4, 2024, 4:14pm

Which is already the case since the internal caching allocator will move GPU memory to its cache once all references are freed of the corresponding tensor.

mattaq31 · April 5, 2024, 1:29pm

Hi both, I posted this query a few years ago now - I do not know if the issue is still valid today. I ended up not being able to resolve the issue and I had to circumvent it by keeping the batch size low when I was using 11GB memory cards and using higher memory cards for higher batch size runs.