RuntimeError: CUDA out of memory with a huge amount of free memory

While training the model for image colorization, I encountered the following problem:

RuntimeError: CUDA out of memory. Tried to allocate 304.00 MiB (GPU 0; 8.00 GiB total capacity; 142.76 MiB already allocated; 6.32 GiB free; 158.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

As we can see, the error occurs when trying to allocate 304 MiB of memory, while 6.32 GiB is free! What is the problem? As I can see, the suggested option is to set max_split_size_mb to avoid fragmentation. Will it help and how to do it correctly?
My batch size = 40
This is my version of PyTorch:

torch==1.10.2+cu113

torchvision==0.11.3+cu113

torchaudio===0.10.2+cu113

Could you post a minimal, executable code snippet to reproduce this issue as well as the output of python -m torch.utils.collect_env, please?

Thank you for the quick response and I hope for your help, it is important for my thesis at the university.
This is python -m torch.utils.collect_env console output:

PyTorch version: 1.10.2+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Pro
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.6.13 |Anaconda, Inc.| (default, Mar 16 2021, 11:37:27) [MSC v.1916 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19041-SP0
Is CUDA available: True
CUDA runtime version: 11.3.109
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 2070
Nvidia driver version: 511.23
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.10.2+cu113
[pip3] torchaudio==0.10.2+cu113
[pip3] torchvision==0.11.3+cu113
[conda] blas 1.0 mkl
[conda] cpuonly 1.0 0 pytorch
[conda] cudatoolkit 11.3.1 h59b6b97_2
[conda] mkl 2020.2 256
[conda] mkl-service 2.3.0 py36h196d8e1_0
[conda] mkl_fft 1.3.0 py36h46781fe_0
[conda] mkl_random 1.1.1 py36h47e9c7a_0
[conda] numpy 1.19.2 py36hadc3359_0
[conda] numpy-base 1.19.2 py36ha3acd2a_0
[conda] pytorch-mutex 1.0 cpu pytorch
[conda] torch 1.10.2+cu113 pypi_0 pypi
[conda] torchaudio 0.10.2+cu113 pypi_0 pypi
[conda] torchvision 0.11.3 pypi_0 pypi

I don’t really know if I can post GitHub links here, but that would be the best explanation for your request. I’m trying to run code from this repository, specifically this script:
GitHub repository

Thanks for the update. Could you try to narrow down the code and post a minimal, executable code snippet, which we could use to reproduce the issue?

I think encounter a similiar issue when I load 2 models.
Error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB. GPU 0 has a total capacty of 23.99 GiB of which 12.54 GiB is free. Of the allocated memory 9.92 GiB is allocated by PyTorch, and 6.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Minimal Code that Reproduce Error

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoModelForSeq2SeqLM
quantized_model_dir = 'f:/models/quantized_models/agptq_orca13b_8bit_noDesc_act'
model1 = AutoModelForSeq2SeqLM.from_pretrained("f:/models/facebook/nllb-200-distilled-600M", device_map='cpu')
model2 = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")

Notice that I the set device_map to cpu for model1 but it still caused the problem.

I got this problem solved after I load the model2 before model1. I am not sure if “loading the larger model first" is a good practice.

Thanks