Evaluation runs out of CUDA memory on the evaluation step

Hi all,

I am creating a Mask R-CNN model to detect and mask different sections of dried plants from images. The images we are dealing with are quite large, my model trains without running out of memory, but runs out of memory on the evaluation, specifically on the outputs = model(images) inference step. Both my training and evaluation steps are in different functions with my evaluation function having the torch.no_grad() decorator, also batch size for both training and evaluation are 1.

I’m not sure why my model would be able to train without running out of memory but fail during evaluation.

I have generally followed the steps here, using the same structure, engine and such.

actual error:

File "at025_main.py", line 307, in <module> main() File "at025_main.py", line 238, in main evaluate(model, data_loader_test, device=device) File "/home/a.kia5/.conda/envs/at025/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/lustrehome/home/a.kia5/at025/mask_rcnn_pytorch/engine.py", line 93, in evaluate outputs = model(images) File "/home/a.kia5/.conda/envs/at025/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/a.kia5/.conda/envs/at025/lib/python3.7/site-packages/torchvision/models/detection/generalized_rcnn.py", line 99, in forward detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes) File "/home/a.kia5/.conda/envs/at025/lib/python3.7/site-packages/torchvision/models/detection/transform.py", line 233, in postprocess masks = paste_masks_in_image(masks, boxes, o_im_s) File "/home/a.kia5/.conda/envs/at025/lib/python3.7/site-packages/torchvision/models/detection/roi_heads.py", line 479, in paste_masks_in_image ret = torch.stack(res, dim=0)[:, None] RuntimeError: CUDA out of memory. Tried to allocate 6.84 GiB (GPU 0; 15.78 GiB total capacity; 7.63 GiB already allocated; 6.59 GiB free; 8.00 GiB reserved in total by PyTorch)

Any help would be appreciated.

Python uses function scoping so unless you wrap the training and validation steps into own functions the tensors from a previous iterations might be kept alive.
Since you are running OOM during the validation I would guess that you are still holding references to some training tensors (and maybe even the computation graph), which would thus need additional memory to be able to perform the validation run.
Let me know, if this might be the case.

Hi ptrblck,

Thanks for the suggestion, unfortunetly I don’t think it’s that as my training loop is this:

for epoch in range(num_epochs):
    train_one_epoch(model, optimizer, data_loader_train, device, epoch, print_freq=1)
    print('Epoch done - Beginning evalutation')
    evaluate(model, data_loader_test, device=torch.device('cpu'))

With the empty cache statements to try to fix this problem, and the train and evaluate functions available here.

OK, so it shouldn’t be a “scope” issue.
What happens if you swap the training and validation calls? If you are also running OOM, this would mean that the validation call uses too much memory and there wouldn’t be any interaction with the training loop.

I have isolated the evaluation step and it still runs out of memory in the same way, despite of the training step. I don’t understand though why the evaluation step would use more memory than the training.

It depends on your setup, e.g. are you increasing the batch size during evaluation, are you wrapping the code in with torch.no_grad()/inference_mode() to save memory?

Both my dataloaders have a batch size of 1

    data_loader_train = torch.utils.data.DataLoader(

    data_loader_test = torch.utils.data.DataLoader(

Also the evaluation function has the @torch.no_grad() decorator. As per the torchvision detection engine code.

Can you try removing the lr_scheduler()? I was having issues with that before. You may also need to consider adding .detach() to your model outputs before any evaluation metrics. Additionally, in an RNN, if I recall, you should be detaching the hidden layers between runs or the graph keeps getting expanded.

Lastly, is there a reason for moving the evaluation to the cpu?

That’s strange indeed. Are you using the same sample dimensions (in case you are resizing the inputs) or is there any other difference between the DataLoaders? If not, could you just swap the DataLoaders and see if the evaluation method would still fail while training works?


I’m running tests without the learning rate scheduler now. The OOM error occurs on the inference step (i.e. output = model(image) ), so deteching it wouldn’t make a difference at this point as that occurs after the inference is complete. Also this error occurs even when it is just evaluation being computed.

Yeah sorry that was because I am currently using running tests where evaluation takes place on the CPU only and hadn’t changed the code back properly. FYI the training loop does work without OOM errors when the model is moved to the CPU and then evaluated there (ofc, as that has nothing to do with CUDA).


All the images are uniform (taken from the same camera in the same conditions). I have a custom dataset class that gets instansiated twice (once for the test, once for train) so they are functionally identical. I will try running now with the dataloaders swapped and see if it makes a difference. There is high demand on the cluster for the moment so I will get back to you when it completes.


Switching Dataloaders made no difference, both failed during the evaluation step and training worked fine, so it must be something wrong with the evaluation code. It shouldn’t be any specific example as the test/train examples are chosen randomly at the initalisation stage. There must be some kind of funadmental difference between the inference and training of the model that is resulting in OOM error.

Also removing the LR scheduler made no difference.

can you share the training/validation code? I think you should compare train_one_epoch vs. evaluate, maybe there is some difference there
so the same size of images both for train and evaluate, right?
are there different transformations you use before training and before evaluation?


The training/validation engine code I’m using is here, I have editited it slightly to include some torch.cuda.empty_cache() statements but it is functionally the same code.

All of the images are of identical size from the same camera, there are differences in the number of objects per image but not by a huge amount, also which images are train and are test are chosen randomly each time at runtime so I don’t believe it is one sample causing issues. Also the test and train dataloaders are now both using shuffle=True and the OOM always occurs on the first time evaluation is called.

I did have a thought.

The images are quite large, and at the moment I’m having the Mask RCNN model perform resizing through the GeneralizedRCNNTransform that gets called through forward() line 130. Could it be that forward is not getting called in the inference step and so the resizing isn’t occuring and so they are going in at full size (3600*5100) therefore OOM?

Im relativetly new to PyTorch and so don’t fully understand the use of every function and im not sure if forward is only used in training or what.

can you take out of evaluation the work with threads?

you are not working with threads at train.

I would try making the evaluation the exact same as train, just adding with torch no grad to make sure this is in eval mode

also - I think you should define the coco api once, and not every time you get into validation:
coco = get_coco_api_from_dataset(data_loader.dataset)

in train - why do you define the loss as the result of the model? (loss_dict = model(images, targets))
you should do the same thing both in train and in eval
outputs = model(inputs)
and then calculate the losses