RuntimeError: CUDA out of memory ONLY for validation but NOT for training

Nathane_Berrebi · December 12, 2021, 10:09pm

Hello, I’m trying to train a pre-trained Faster R-CNN in Google Collab Pro. My function for the training is OK and everything works very well.

But I tried to launch a validation step using the exact same function except that i don’t use the backpropagation when I computed the loss.
But when I skip the backpropagation step, I get an error of CUDA MEMORY, whilst when I apply the backpropagation (during the training) I don’t have any error of CUDA MEMORY even without using the torch.cuda.empty_cache(). This is my code :

def train_one_epoch(model, optimizer, data_loader, device, validation):    
    model.train()
    for i, values in enumerate(data_loader):
        images, targets = values
        images = list(image.to(device) for image in images)
        targets = [{'boxes' : t['boxes'] , 'labels' : t['labels'].to(device)} for t in targets ]
        loss_dict = model(images, targets)
        losses = sum(loss for loss in loss_dict.values())
        optimizer.zero_grad()

        if validation == False :               
            losses.backward()
            optimizer.step()

        torch.cuda.empty_cache()

The only thing that changes between training and validation is that during training I set validation = False, so I enter in the if condition. And during validation i set the validation variable = True, so I don’t ente rin the if condition.

So, setting validation to False gave me this error :

RuntimeError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 15.90 GiB total capacity; 14.53 GiB already allocated; 25.75 MiB free; 14.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Does someone have an idea to fix it please ?

JosephC · December 12, 2021, 11:39pm

I think if you wanted to disable the gradients for validation you might want to do torch.no_grad() for one of the steps rather than just conditionally skipping the calls to losses.backwards() and optimizer.step. Or, alternatively, set model.eval() before running the evaluation step and then set model.train() when you’re done.

Nathane_Berrebi · December 13, 2021, 1:10pm

Hello @JosephC thanks for your answer ! But if I run model.eval() I will not be allowed to put the target in the command model(images, targets).

Also If you know how to do it, I would like during the training mode to get the prediction and the loss. But when I do model(images, targets) I only get the loss.

Thanks for your help !

mMagmer · December 13, 2021, 2:38pm

Because torch auto grad wants to (and can) compute grad on multiple pass,
as @JosephC said, you should always use

with torch.no_grad():
metric = evaluate(model)

In forward pass because you call backward, torch removes all data required for computing grad after computing it.

JosephC · December 13, 2021, 11:03pm

Without seeing more of the code it’s a little hard to say for sure, but hearing that model is taking both the inputs and the targets is a little unusual. Normally, we want the model to only take the inputs. The loss function should be something separate.

This is some pseudo-code with a little more of the best practice. Note that the loss_fn is separate from the model. The reason for this is we don’t really care about keeping the loss function when storing the model – we just want to use it.

Example:

def train_one_epoch(model, optimizer, data_loader, device, validation):    
    model.train()
    loss_fn = torch.nn.BCELoss()
    for i, values in enumerate(data_loader):
        images, targets = values
        images = list(image.to(device) for image in images)
        targets = [{'boxes' : t['boxes'] , 'labels' : t['labels'].to(device)} for t in targets ]

        if validation:
            torch.cuda.empty_cache()
            model.eval()  # Optional, or use the torch.no_grad()
            with torch.no_grad():
                predictions = model(images)
                loss_dict = loss_fn(images, targets)
                losses = sum(loss for loss in loss_dict.values())
                # Do something with losses.
            model.train()
        else:
            optimizer.zero_grad()
            predictions = model(images)
            loss_dict = loss_fn(images, targets)
            losses = sum(loss for loss in loss_dict.values())
            losses.backward()
            optimizer.step()

That being said, you may be able to work around this issue with only a few code changes by doing any of the following:

Reduce your batch size to something smaller.
Perform a validation on a smaller set of items.

Those would be band-aids, but they’re something to try. I’d do the no_grad thing first.

Nathane_Berrebi · December 22, 2021, 1:02pm

Thank you very much @JosephC ! But when I runned this code in the training mode (validation = False), I got this error :

2 frames
<ipython-input-19-3c77990fe1dc> in train_one_epoch_forum_pytorch_(model, optimizer, data_loader, device, validation)
     20         else:
     21             optimizer.zero_grad()
---> 22             predictions = model(images)
     23             print(predictions)
     24             loss_dict = loss_fn(images, targets)

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102             return forward_call(*input, **kwargs)
   1103         # Do not call functions when jit is used
   1104         full_backward_hooks, non_full_backward_hooks = [], []

/usr/local/lib/python3.7/dist-packages/torchvision/models/detection/generalized_rcnn.py in forward(self, images, targets)
     55         """
     56         if self.training and targets is None:
---> 57             raise ValueError("In training mode, targets should be passed")
     58         if self.training:
     59             assert targets is not None

ValueError: In training mode, targets should be passed

How should I do to extract the prediction in the train mode ?

JosephC · December 23, 2021, 4:49am

@Nathane_Berrebi, looking at the error:

/usr/local/lib/python3.7/dist-packages/torchvision/models/detection/generalized_rcnn.py in forward(self, images, targets)
     55         """
     56         if self.training and targets is None:
---> 57             raise ValueError("In training mode, targets should be passed")
     58         if self.training:
     59             assert targets is not None

ValueError: In training mode, targets should be passed

If self.training is True, then targets cannot be None. We want targets to be None, so self.training must be set to false. To do this, we must call model.eval() at the start of the validation section.

my3bikaht · December 23, 2021, 6:31am

Just an opinion here. I prefer to code validation loop aside from training loop. Maybe resulting code isn’t as compact as combined loop, but it is much easier to see what goes where.