RuntimeError: CUDA out of memory. Tried to allocate 450.00 MiB

Hello PyTorch-Community,

i am very new to PyTorch and I try to get this Segmentaton Network running on my Notebook with Geforce GTX 1650. I’d be glad if you can give me hints.

The following error occurs:

RuntimeError: CUDA out of memory. Tried to allocate 450.00 MiB (GPU 0; 3.82 GiB total capacity; 2.08 GiB already allocated; 182.75 MiB free; 609.42 MiB cached)

It obviously means, that i dont have enough memory on my GPU. But I dont understand why, because 3.8GB - 2GB - 600MB = 1.2GB free space != 180MB . In similar Questions people say, that this is due to fragmentation. But how does this make sense, i only load a pretrained model to my GPU? Where do the 2GB occupied space come from?

Then I open up nvidia-smi, which makes me wonder even more, as it says that only 10% are occupied:

| NVIDIA-SMI 435.21       Driver Version: 435.21       CUDA Version: 10.1     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX 1650    Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   47C    P8     1W /  N/A |    285MiB /  3914MiB |     10%      Default |
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|    0      1214      G   /usr/lib/xorg/Xorg                            28MiB |
|    0      1720      G   /usr/lib/xorg/Xorg                           105MiB |
|    0      1984      G   /usr/bin/gnome-shell                         103MiB |

Im also using PyTorch 1.1 with Torchvision 0.3, the network does not function with newer versions due some boolean changes.

Here is a short summary of whats called (i execute this script):

    # (...)
    model = DistributedDataParallel(model.cuda(device), device_ids=[device_id], output_device=device_id)
    # (...)
    # (...)
    # Output of torch.cuda.memory_allocated: 512089088 (512,089,088 bytes)
    for it, batch in enumerate(dataloader):  # Batch Size 1
        with torch.no_grad():
            # (...)
            torch.cuda.empty_cache() # Does not change allocated memory

            # Here RuntimeError occurs:
            _, pred, _ = model(img=img, do_loss=False, do_prediction=True)

I use pretrained weights and don’t have the resources to train it from scratch with a different architecture. Maybe the model is too large for my GPU, but it only needs 500MB in the beginning + 450MB when doing predicition?. How can it be, that it does not fit on my GPU?

I’d be glad if you have any further tips i can investigate in.

The intermediate tensors, which are needed to calculate the gradients, the gradients themselves, as well as the CUDA context will all use memory on your device, which could explain the OOM issue.

You could try to decrease the batch size, if possible, and check the memory usage for a smaller batch size. Alternatively, you could use torch.utils.checkpoint to trade compute for memory.

Also, you could try to run your code in e.g. Colab, which would give you a GPU with more memory for a limited time, and check the memory usage there for your current code.

Thank you very much for your answer. I got it running in Colab, they use a GPU with 8GB Memory, the Network is using nearly all of it.

The batch size was already 1. torch.utils.checkpoints are used when training the network, if I understood correctly. I’m using a pretrained network and I’m doing only the forward pass in torch.no_grad()-mode, so it should have no effect, right?

Yes, checkpoints are normally used while training the model. If you’re not training your model then there shouldn’t be any gradients to be computed and hence GPU memory usage would be low. However what would be on the GPU would be your pretrained model, input tensors, intermediate tensors and then output tensors. If you’re only using batch size 1 then there shouldn’t be such problem unless you’ve mistakenly kept all your input data on GPU or your input tensors are unreasonably large, but that sounds improbable.

What you can do is to monitor GPU memory usage after each line of code you think would take up memory. You can use GPUtil library:

import GPUtil
out = model(input)
1 Like

Thank you, I will try it!

You may also be interested in this:

Large Model Support allows you to overcommit GPU memory by using host memory as a swap space for inactive tensors.