CUDA oom despite having gigabytes of free space

I’m getting an OOM error even though in the error message itself it says I have plenty of room:

RuntimeError: CUDA out of memory. Tried to allocate 98.00 MiB (GPU 0;
7.93 GiB total capacity; 1.26 GiB already allocated; 5.95 GiB free; 1.28 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory
try setting max_split_size_mb to avoid fragmentation.  See documentation for
Memory Management and PYTORCH_CUDA_ALLOC_CONF

This is happening for me on a pretty normal training loop, starting from a pretrained torchvision model (doesn’t matter whether I do a ResNet or a ViT). I can include it here if that’s useful, but first I’d really like to understand how it’s even possible to see an error like this.

version info

  • PyTorch 1.12.1
  • torchvision 0.13.1
  • CUDA 11.4
  • driver version 470.141.03

Could you post an executable code snippet, which would reproduce the issue as well as more information about your setup (in particular which GPU is used and which CUDA runtime in the binaries)?

I’m able to produce this error very easily:

import torch
import torchvision.models


def main():
    device = "cuda:0"
    model = torchvision.models.resnet50().to(device=device)

    data = torch.rand(64, 3, 224, 224, device=device)

    model(data)


if __name__ == "__main__":
    main()

Running that produces exactly the following:

RuntimeError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 7.93
GiB total capacity; 2.05 GiB already allocated; 5.16 GiB free; 2.07 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation.  See documentation for
Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have a GTX 1070; here are more details:

output of lspci
$ sudo lspci -v | grep -A 19 VGA
02:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: Micro-Star International Co., Ltd. [MSI] GP104 [GeForce GTX 1070]
	Physical Slot: 6
	Flags: bus master, fast devsel, latency 0, IRQ 55, NUMA node 0
	Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
	Memory at e0000000 (64-bit, prefetchable) [size=256M]
	Memory at f0000000 (64-bit, prefetchable) [size=32M]
	I/O ports at e000 [size=128]
	Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Legacy Endpoint, MSI 00
	Capabilities: [100] Virtual Channel
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] Secondary PCI Express
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

You asked for more details about my CUDA runtime version, but I’m not quite sure what you mean beyond what I already reported above.

I’ve tried restarting the machine a few times, and nothing has changed.

Was this system working before at all or is this a new one?
Did you also update any drivers recently?

Based on the error message, I assume that the setup is somehow corrupt and would recommend to reinstall the latest drivers. As another test you could also run some workloads in docker containers and see if this would change the behavior or if the same error would be raised.

This was a working setup as of a few weeks ago, but I haven’t used it in a while so I’m not sure if I’ve updated drivers since it was working last.

I tried running the same script in a Docker container and it worked! How do I narrow in on the problem now? I see the exact same driver and runtime versions on the host and in the container when I run nvidia-smi.

To be clear, it fails on the host whether in a venv with torch from pip, or in a conda env with torch installed from conda. And it succeeds in a container in both of those scenarios.