CUDA out of memory while increase num_workers in DataLoader

Hi,

I am facing a problem with DataLoader. I am training a classification problem, the code runs normally with num_workers equal 0 but it raised CUDA out of memory problem when I increased the num_workers.

My GPU: RTX 3090
Pytorch version: 1.8.0.dev20201104 - pytorch-nightly
Python version: 3.7.9
Operating system: Windows
CUDA version: 10.2

This case consumes 19.5GB GPU VRAM.

train_dataloader = DataLoader(dataset = train_dataset, 
                                  batch_size = 16, \
                                  shuffle = True,
                                  num_workers= 0)

This case return: RuntimeError: CUDA out of memory. Tried to allocate 90.00 MiB (GPU 0; 24.00 GiB total capacity; 13.09 GiB already allocated; 5.75 GiB free; 13.28 GiB reserved in total by PyTorch)

train_dataloader = DataLoader(dataset = train_dataset, 
                                  batch_size = 16, \
                                  shuffle = True,
                                  num_workers= 8)

I can understand if it was run out of PC memory, but run out of CUDA memory is so weird. Is that because of Nightly version?

Update:

I just install CUDA 11.1.
The memory in GPU is the same with num_workers = 0 or 2 or 4, but CUDA out of memory in 8.

Are you pushing the data to the GPU inside the Dataset (in the __init__ or __getitem__)?
If so, increasing the number of workers would also increase the GPU memory usage, since each worker would push the data to the device.
If thatā€™s not the case, could you post an executable code snippet so that we could reproduce this issue, as the device memory shouldnā€™t increase while loading data using the CPU.

4 Likes

@ptrblck thank you for the quick reply, I didnā€™t push data into GPU in Dataset.

This is my Dataset:

import os
import cv2
import numpy as np 
import pandas as pd

import torch
import torch.nn as nn
from torch.utils.data import Dataset

from matplotlib import pyplot as plt
import albumentations.pytorch as AT
from albumentations import (
    RandomRotate90, Flip, Transpose, GaussNoise, Blur, VerticalFlip, HorizontalFlip,  \
    HueSaturationValue, RGBShift, RandomBrightness, Resize, Normalize, Compose, CenterCrop)
from PIL import Image

def transforms(size_image):
    return Compose([
        Resize(height = size_image[0], width = size_image[1]),
        # Red - Green - Blue right now
        Normalize(mean=(0.406, 0.515, 0.323), std=(0.195, 0.181, 0.178)),
        AT.ToTensor()
    ])

def augmentation(size_image,p=0.5):
    return Compose([
        RandomRotate90(),
        Flip(),
        Transpose(),
        GaussNoise(),
        Blur(),
        VerticalFlip(),
        HorizontalFlip(),
        HueSaturationValue(hue_shift_limit=5, sat_shift_limit=15, val_shift_limit=10),
        RGBShift(r_shift_limit=10, g_shift_limit=10, b_shift_limit=10),
        RandomBrightness(limit = 0.05),
        CenterCrop(height = 150, width = 150, p = 0.5)    
    ], p=p)

class PyTorchImageDataset(Dataset):
    def __init__(self, image_list,train, labels, size_image, **kwags):
        self.image_list = image_list
        self.transforms = transforms(size_image)
        self.labels = labels
        self.train = train
        self.augment = augmentation(size_image)
        import json
        with open('C:/Users/user/name2path.json', 'r') as f: 
            self.image_path_dir = json.load(f)

    def __len__(self):
        return len(self.image_list)
    
    def __getitem__(self, i):
        image_path = self.image_path_dir[self.image_list[i]]
        image = np.array(Image.open(image_path))
        label = self.labels[i]
        if self.train:
            image = self.augment(image=image)['image']   
        image = self.transforms(image = image)['image']
        return image, label

    def isImage(self,path):
        all_image_ext =  ["jpg","gif","png","tga","jpeg"]
        return True if (path.split('.')[-1].lower()) in all_image_ext else False

And my train code:

import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
import sys

import torch
import torch.nn as nn
import torch.nn.parallel
import torch.backends.cudnn as cudnn
import torch.optim as optim
from torch.utils.data import DataLoader

from models import se_resnext50_32x4d
from tqdm import tqdm

from sklearn.model_selection import train_test_split
from utilities import *

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
params = {'workers': 4,
            'batch_size': 16,
            'num_epochs': 100,
            'lr': 0.001,
            'size_image': [480, 768],
            'checkpoint': True,
            'save_path': 'C:/Users/user/save',
            'dataFrame_path': 'C:/Users/user/train.csv',
            'max_patience': 5}

checkDir(params['save_path'])
df = pd.read_csv(params['dataFrame_path'])


IDs = df.id.to_list()
labels = df.landmark_id.to_list()
train ,val, y_train,y_val = train_test_split(IDs, labels, test_size = 0.2, random_state = 42, shuffle = True)

train_dataset = PyTorchImageDataset(image_list = train, 
                                    labels = y_train, \
                                    train = True, 
                                    size_image = params['size_image'])
train_dataloader = DataLoader(dataset = train_dataset, 
                                batch_size = params['batch_size'], \
                                shuffle = True,
                                num_workers=params['workers'])

val_dataset = PyTorchImageDataset(image_list=val,
                                    labels=y_val, \
                                    train = False, \
                                    size_image = params['size_image'])
val_dataloader = DataLoader(dataset=val_dataset, \
                            batch_size=params['batch_size'], \
                            shuffle=True, \
                            num_workers=params['workers'])

dataloader = {'train': train_dataloader, 'valid': val_dataloader}

model = se_resnext50_32x4d()
model.to(device)
model.train()

criterion = nn.CrossEntropyLoss()
softmax = nn.Softmax()
optimizer = optim.Adam(model.parameters(), lr=params['lr'], betas=(0.9, 0.999))

patience = 0
last_acc = -1
best_acc = 0.3

for epoch in range(params['num_epochs']):
    for phase in ['train','valid']:
        if phase == 'train':
            running_loss = 0
            for i, data in enumerate(tqdm(dataloader[phase])):
                optimizer.zero_grad()
                data_batch = data[0].to(device)
                b_size = data_batch.size(0)                
                label = data[1].type(torch.long)
                label = label.to(device)
                output = model(data_batch)
                prob = softmax(output)
                loss = criterion(output, label)
                running_loss+= loss.item()
                loss.backward()
                optimizer.step()  
            print('epoch %d train loss: %.3f' %(epoch + 1, float(running_loss)/(1+i)))

        else:
            running_loss = 0
            with torch.no_grad():
                for i, data in enumerate(tqdm(dataloader[phase])):
                    data_batch = data[0].to(device)
                    b_size = data_batch.size(0)
                    label = data[1].type(torch.long)
                    label = label.to(device)
                    output = model(data_batch)
                    prob = softmax(output)
                    loss = criterion(output, label)
                
                    if phase == 'valid' and i == 0:

                        valid_label = label.cpu().detach().numpy()
                        
                        valid_prob = prob.cpu().detach().numpy()
                                                    
                        running_loss+= loss.item()
                    else:
                        torch.cuda.synchronize() 

                        temp_label = label.cpu().detach().numpy()

                        valid_label = np.concatenate((valid_label, temp_label))

                        torch.cuda.synchronize() 

                        temp_prob = prob.cpu().detach().numpy()

                        valid_prob = np.concatenate((valid_prob, temp_prob), axis = 0)

                        running_loss+= loss.item()
    if phase == 'valid':
        
        last = 0
        torch.cuda.synchronize() 
        predict_labels = np.argmax(valid_prob, axis=1)
        acc = accuracy(valid_label, predict_labels)
        loss = round(float(running_loss)/(i + 1), 4)
        print('epoch %d valid acc: %.3f' %(epoch + 1, acc))
        print('epoch %d valid loss: %.3f' %(epoch + 1, loss))
        
        path_name = params['save_path'] + 'se_resnext_' + str(last + epoch +  1) + '_loss_' + str(loss) + '_acc_' + str(acc) + '.pth'
        if acc > best_acc:
            torch.save(model.state_dict(), path_name)
            best_acc = acc
        if acc < last_acc:
            patience+=1
        elif acc >= last_acc:
            patience = 0
        if patience == params['max_patience']:
            torch.save(model.state_dict(), path_name)
            sys.exit()
        last_acc = acc

@ptrblck
Hi, can you confirm that it is my fault or something wrong in Pytorch?

Thanks for the code.
Iā€™ve removed the training part, as the DataLoader's workers should increase the memory usage as described in your initial post.
Using this code snippet:

import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
import sys

import torch
import torch.nn as nn
import torch.nn.parallel
import torch.backends.cudnn as cudnn
import torch.optim as optim
from torch.utils.data import DataLoader

import numpy as np

import torch
import torch.nn as nn
from torch.utils.data import Dataset

from matplotlib import pyplot as plt
import albumentations.pytorch as AT
from albumentations import (
    RandomRotate90, Flip, Transpose, GaussNoise, Blur, VerticalFlip, HorizontalFlip,  \
    HueSaturationValue, RGBShift, RandomBrightness, Resize, Normalize, Compose, CenterCrop)
from PIL import Image

def transforms(size_image):
    return Compose([
        Resize(height = size_image[0], width = size_image[1]),
        # Red - Green - Blue right now
        Normalize(mean=(0.406, 0.515, 0.323), std=(0.195, 0.181, 0.178)),
        AT.ToTensor()
    ])

def augmentation(size_image,p=0.5):
    return Compose([
        RandomRotate90(),
        Flip(),
        Transpose(),
        GaussNoise(),
        Blur(),
        VerticalFlip(),
        HorizontalFlip(),
        HueSaturationValue(hue_shift_limit=5, sat_shift_limit=15, val_shift_limit=10),
        RGBShift(r_shift_limit=10, g_shift_limit=10, b_shift_limit=10),
        RandomBrightness(limit = 0.05),
        CenterCrop(height = 150, width = 150, p = 0.5)
    ], p=p)

class PyTorchImageDataset(Dataset):
    def __init__(self, data, labels, size_image, **kwags):
        self.transforms = transforms(size_image)
        self.data = data
        self.labels = labels
        self.augment = augmentation(size_image)

    def __len__(self):
        return len(data)

    def __getitem__(self, i):
        image = self.data[i]
        label = self.labels[i]

        image = self.augment(image=image.permute(1, 2, 0).numpy())['image']
        image = self.transforms(image = image)['image']
        return image, label


data = torch.randn(10, 3, 255, 255)
labels = torch.randint(0, 1000, (10,))
train_dataset = PyTorchImageDataset(data=data,
                                    labels=labels,
                                    size_image=(224, 224))

num_workers=20
train_dataloader = DataLoader(dataset=train_dataset,
                              batch_size=5,
                              shuffle=True,
                              num_workers=num_workers)

print('num_workers={}'.format(num_workers))
device = 'cuda:0'
for epoch in range(10):
    for i, data in enumerate(train_dataloader):
       data_batch = data[0].to(device)
       label = data[1].to(device)
       print('{}MB allocated'.format(torch.cuda.memory_allocated()/1024**2))

yields the same memory usage for different number of workers:

num_workers=0
2.87158203125MB allocated
2.87158203125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated

num_workers=2
2.87158203125MB allocated
2.87158203125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated

num_workers=20
2.87158203125MB allocated
2.87158203125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated

Could you check, if this code snippet reproduces the issue on your system?

Thank you for your answer, this is the results:

At num_workers = 0 -> 7, it work well:

num_workers=0
2.87158203125MB allocated
2.87158203125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated

From 8->11, sometimes it raises bug while running, for example:

num_workers=11
2.87158203125MB allocated
2.87158203125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\giang\anaconda3\envs\working\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "C:\Users\giang\anaconda3\envs\working\lib\multiprocessing\spawn.py", line 114, in _main
    prepare(preparation_data)
  File "C:\Users\giang\anaconda3\envs\working\lib\multiprocessing\spawn.py", line 225, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Users\giang\anaconda3\envs\working\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
    run_name="__mp_main__")
  File "C:\Users\giang\anaconda3\envs\working\lib\runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "C:\Users\giang\anaconda3\envs\working\lib\runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "C:\Users\giang\anaconda3\envs\working\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\giang\Desktop\DACON_landmark\test.py", line 5, in <module>
    import torch
  File "C:\Users\giang\anaconda3\envs\working\lib\site-packages\torch\__init__.py", line 117, in <module>
    raise err
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\Users\giang\anaconda3\envs\working\lib\site-packages\torch\lib\cudnn_adv_infer64_8.dll" or one of its dependencies.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\giang\anaconda3\envs\working\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "C:\Users\giang\anaconda3\envs\working\lib\multiprocessing\spawn.py", line 114, in _main
    prepare(preparation_data)
  File "C:\Users\giang\anaconda3\envs\working\lib\multiprocessing\spawn.py", line 225, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Users\giang\anaconda3\envs\working\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
    run_name="__mp_main__")
  File "C:\Users\giang\anaconda3\envs\working\lib\runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "C:\Users\giang\anaconda3\envs\working\lib\runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "C:\Users\giang\anaconda3\envs\working\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\giang\Desktop\DACON_landmark\test.py", line 5, in <module>
    import torch
  File "C:\Users\giang\anaconda3\envs\working\lib\site-packages\torch\__init__.py", line 117, in <module>
    raise err
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\Users\giang\anaconda3\envs\working\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\giang\anaconda3\envs\working\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "C:\Users\giang\anaconda3\envs\working\lib\multiprocessing\spawn.py", line 114, in _main
    prepare(preparation_data)
  File "C:\Users\giang\anaconda3\envs\working\lib\multiprocessing\spawn.py", line 225, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Users\giang\anaconda3\envs\working\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
    run_name="__mp_main__")
  File "C:\Users\giang\anaconda3\envs\working\lib\runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "C:\Users\giang\anaconda3\envs\working\lib\runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "C:\Users\giang\anaconda3\envs\working\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\giang\Desktop\DACON_landmark\test.py", line 5, in <module>
    import torch
  File "C:\Users\giang\anaconda3\envs\working\lib\site-packages\torch\__init__.py", line 117, in <module>
    raise err
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\Users\giang\anaconda3\envs\working\lib\site-packages\torch\lib\cudnn_adv_infer64_8.dll" or one of its dependencies.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\giang\anaconda3\envs\working\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "C:\Users\giang\anaconda3\envs\working\lib\multiprocessing\spawn.py", line 114, in _main
    prepare(preparation_data)
  File "C:\Users\giang\anaconda3\envs\working\lib\multiprocessing\spawn.py", line 225, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Users\giang\anaconda3\envs\working\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
    run_name="__mp_main__")
  File "C:\Users\giang\anaconda3\envs\working\lib\runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "C:\Users\giang\anaconda3\envs\working\lib\runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "C:\Users\giang\anaconda3\envs\working\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\giang\Desktop\DACON_landmark\test.py", line 5, in <module>
    import torch
  File "C:\Users\giang\anaconda3\envs\working\lib\site-packages\torch\__init__.py", line 117, in <module>
    raise err
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\Users\giang\anaconda3\envs\working\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies.
Traceback (most recent call last):
  File "test.py", line 82, in <module>
    for i, data in enumerate(train_dataloader):
  File "C:\Users\giang\anaconda3\envs\working\lib\site-packages\torch\utils\data\dataloader.py", line 359, in __iter__
    return self._get_iterator()
  File "C:\Users\giang\anaconda3\envs\working\lib\site-packages\torch\utils\data\dataloader.py", line 301, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "C:\Users\giang\anaconda3\envs\working\lib\site-packages\torch\utils\data\dataloader.py", line 885, in __init__
    w.start()
  File "C:\Users\giang\anaconda3\envs\working\lib\multiprocessing\process.py", line 112, in start
    self._popen = self._Popen(self)
  File "C:\Users\giang\anaconda3\envs\working\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\giang\anaconda3\envs\working\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "C:\Users\giang\anaconda3\envs\working\lib\multiprocessing\popen_spawn_win32.py", line 72, in __init__
    None, None, False, 0, env, None, None)
OSError: [WinError 1455] The paging file is too small for this operation to complete

From 12 ->14, sometimes the memory shows to be located in 2 or 3 epochs then raise the bug.
From 14-> 16, fail from the beginning.

It shows CUDA out-of-memory several times, but I checked on nvida-smi the memory still empty.

My CPU: Intel core i7-10700K

This Windows error seems to indicate you might be running out of CPU RAM. Could this be the case?
Iā€™m not a Windows expert, but a quick search for this error gave some results pointing in this direction.

1 Like

My PC ram is 32Gb and its reaches to 20Gb is maximum when I tried that code, I will install Ubuntu then try it again to make sure it is not the hardware problem.

I just met that problem while changing from 2080Ti to 3090

Iā€™m not particularly skilled, just running projects from github and such, but one thing I keep encountering is error related to multiple workers with dataloader, and most of the time, just puttingthe executing code in a "if name == ā€˜mainā€™ can help. I can only assume that something behaves differently on non-windows systems for those errors not to appear there.

I have encountered similar errors to the above, but I think those were issues with lmdb environments set to take too much space for the drive, so perhaps my suggestion is unhelpful for this problem. But i can at least say that this seems to be related to RAM and/or storage. Perhaps the drive is full enough that memory cannot be paged, or something is artificially taking up storage during the program execution. For my case, I had 2 lmbd environments each set at 200 GB, which was more than what was available on the drive, leading to my programs crashing during startup.

Thank you for your explanation,

In my experience, the problem without if __name__ == "__main__" will make our code can not execute, and I used it while running my code. However, the code runs normally in this case, the problem is the increase in RAM.

Now we donā€™t have to worry about this anymore. At the time I raised this question, I used the nightly version, but they published the stable version for torch 1.7 cuda 11 and it solved the problem.

About your explanation, if I understand correctly, you are talking about the RAM of PC but we are discussing GPU RAM, I think they are different.

Good that itā€™s resolved!

Given the error in the last message, it seemed more like a PC RAM error and not GPU RAM issue, since it talks about paging file. GPU memory errors tend to mention being GPU RAM. (Usually some cuda error and allocation fail due to no available error.)

The problem I see here is that the memory allocation varies between batches. Why would that be?
I just recently came across this problem as well, as my model + training batch barely fit into the memory. And because of that this double memory allocation on first two batches is critical.

I donā€™t know which memory usage you are observing, but in case itā€™s the GPU memory you might want to check if each batch has the same shape as the memory usage would depend on it.
I also donā€™t understand the ā€œdouble memory allocation on first two batches is criticalā€ part so could you explain where and why memory is doubled?

Itā€™s in the output snippet that you posted:

num_workers=0
2.87158203125MB allocated
2.87158203125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated

num_workers=2
2.87158203125MB allocated
2.87158203125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated

num_workers=20
2.87158203125MB allocated
2.87158203125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated
1.14892578125MB allocated

First two batches allocate more memory than all the others. Is it maybe because PyTorch preallocates a bigger chunk of memory up front, but once it realizes it doesnā€™t need that much it frees it? Could that be the case?

Thanks for pointing to the previous post.
I donā€™t think the DataLoader or the Dataset caused any of these increases in the GPU memory usage, but the allocations of the tensors inside the training loop.
If you want to keep the memory usage flat, you should try to delete all unneeded tensors at the end of the training iteration to free their memory before starting the next iteration.