Pytorch Cannot allocate memory

Meisam_Hasani · October 21, 2021, 8:35am

I am trying to load the huge coco dataset (120000-image) and do some training. I am using my docker container for the task.
For faster training I try to load the whole data using pytorch dataloader into a python array (on the system memory not the gpu memory), and feed the model with that python array, so I won’t use the dataloader during the training.
The problem is after loading a bunch of data (something around 10-15 GB) I encounter this strange error:

File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
  data = self._data_queue.get(timeout=timeout)
File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 116, in get
  return _ForkingPickler.loads(res)
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 294, in rebuild_storage_fd
  storage = cls._new_shared_fd(fd, size)
RuntimeError: falseINTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/aten/src/ATen/MapAllocator.cpp":303, please report a bug to PyTorch. unable to mmap 8 bytes from file <filename not specified>: Cannot allocate memory (12)

I know it’s a huge data but the numbers match. I am using a linux server which has 200GB memory, so it’s not a memory lacking problem. I also set the docker shared memory on 200 GB to overcome the docker limit on memory usage, but still the problem exist.
For smaller dataset (which needs less than 20GB) my code works fine.
What should I do for this volume of data? What did I miss?

Meisam_Hasani · October 21, 2021, 11:15am

I put

num_worker = 0

and solved the problem!

Chenkai_Mao · November 17, 2021, 9:19pm

Where and how did you add this argument?

Meisam_Hasani · November 18, 2021, 10:08am

In the dataloader:

dataloaders = torch.utils.data.DataLoader( image_datasets['train']	
    	                                            ,batch_size=batch_size
    	                                            ,num_workers=num_workers
    	                                            ,collate_fn=collate_fn
    	                                    )

Please check

torch.utils.data.DataLoader

in the documentation for more details.

visionscaper · March 25, 2022, 8:23pm

The question is, why doesn’t it work with multiple DataLoader workers?
You did not solve the problem, you circumvented it.

I also experience this with FashionMNIST using the latest Pytorch 1.11 (torch vision 0.12.0).

I’m getting the following error:

unable to mmap 3136 bytes from file </torch_29142_1440317042_64566>: Cannot allocate memory (12)

@ptrblck Do you know what kind of error this is, and why it could occur specifically when multiple data loaders are used?

ptrblck · March 26, 2022, 7:05am

Unsure what’s causing the issue but based on e.g. this older bug description it seems as if Python fails to allocate the resident size for a new process.
How much system RAM do you have and how much does the main Python process need?

Lu_Dai · May 30, 2022, 2:28am

I also met the same problem and solved it by modifying num_workers to 0 as well. Even I changed num_workers to 1, this “cannot allocate memory” problem would recur. I still wonder why this problem would happen?

Charlie_Li · June 14, 2022, 10:15am

Having the same issue here, setting num_workers to any value beyond 0 will result in cannot allocate memory, but using num_workers 0 will result in an extremely slow training process.

YijianZhou · August 3, 2022, 6:07am

I believe I met the same problem, and as other people here reported, only when num_worker=0 is bug-free

Andreas_Kopp1 · February 6, 2023, 5:52pm

I have the same issue. Are there any news on this bug? Setting num_workers to 0 works as workaround but is not a good solution as it takes a lot of time in my case.

Andreas_Kopp1 · February 16, 2023, 1:35pm

Hi @ptrblck. Do you know if there any update on the issue? Setting workers to zero is not really a solution for large datasets. I am experiencing the issue even on a system with 880 GB of RAM.

ptrblck · February 16, 2023, 6:30pm

No, sorry I don’t have any updates as I was never able to reproduce the issue. So far nobody was able to post a minimal and executable code snippet which would raise this error in a current release, so I don’t have a way to debug it.

Andreas_Kopp1 · February 19, 2023, 7:46am

I can reproduce the issue consistently in an environment that I is too big to share it in a code snippet. I will work on a simplified repro that I can share. In the meantime, do you have additional hints for troubleshooting?
Here are some info about the setup:

Image multilabel classification with >700k images with DistributedDataParallel
Training runs in Docker container
96 cores, 880 GB RAM, 4 A100 GPU
Shared mem set to >200GB
“Cannot allocate memory” occurs in eval pass of first epoch
no_workers = 20
Error does not occur when setting no_workers to zero (as others have experienced) but makes training very slow
Thanks for any guidance!

ptrblck · February 19, 2023, 8:08pm

I don’t know if you are seeing the same stacktrace, but the first posted one shows the failure in storage = cls._new_shared_fd(fd, size),so I would start by checking:

how many file descriptors are already open in the process (e.g. via lsof or by reading /prod/pid/fd)
all allocated and max. handles (e.g. via sysctl fs.file-nr)
if your docker setup somehow limits the file descriptor usage.

Andreas_Kopp1 · February 20, 2023, 11:45am

Thank you very much, @ptrblck
Will try your suggestions. Here is the stacktrace (env name redacted):

Traceback (most recent call last):
File “/mnt/azureml/cr/j//exe/wd/train.py”, line 256, in
for i, batch_data in enumerate(valid_loader):
File “/azureml-envs//lib/python3.10/site-packages/torch/utils/data/dataloader.py”, line 530, in next
data = self._next_data()
File “/azureml-envs//lib/python3.10/site-packages/torch/utils/data/dataloader.py”, line 1207, in _next_data
idx, data = self._get_data()
File “/azureml-envs//lib/python3.10/site-packages/torch/utils/data/dataloader.py”, line 1173, in _get_data
success, data = self._try_get_data()
File “/azureml-envs//lib/python3.10/site-packages/torch/utils/data/dataloader.py”, line 1011, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File “/azureml-envs//lib/python3.10/multiprocessing/queues.py”, line 122, in get
return _ForkingPickler.loads(res)
File “/azureml-envs//lib/python3.10/site-packages/torch/multiprocessing/reductions.py”, line 300, in rebuild_storage_fd
storage = cls._new_shared_fd(fd, size)

RuntimeError: unable to mmap 256 bytes from file : Cannot allocate memory (12)

zyc · April 26, 2023, 8:20am

There is a MWE in [Bug][Dataloader] unable to mmap 2048 bytes from file <filename not specified>: Cannot allocate memory (12) · Issue #92134 · pytorch/pytorch · GitHub
This bug happens to me too, we have 1.8 TiB of memory, more than enough for 100 copies of the entire dataset in memory.

tiramisuNcustard · April 26, 2023, 6:53pm

Do you mean num_woker in DataLoader? We need to be careful that total samples to be processed (num_woker * memory_size_of_one_batch) will fit in the RAM.

zyc · May 4, 2023, 6:45pm

No, I meant, the dataset we used is relatively small, around 10 GiB in total, and we have 1.8 TiB memory installed, more that 100 times of the size of dataset.
(num_woker * memory_size_of_one_batch) takes less than 0.001% of usable memory.

klex · October 14, 2023, 2:44pm

Hey all,

I encountered the same error as described in the question (unable to mmap 8 bytes from file <filename not specified>: Cannot allocate memory (12)), and I was able to write a MWE to find the solution in my case. I thought I leave it here in case other people find this thread. In the following code, the error will occur after a few minutes. As far as I can see, the issue is that item.numpy() does not make a copy, and therefore it can be easily solved by explicitly doing so. In this example it is pretty obvious, but in my case it was a bit more subtle, because item was handed around a bit in the code (but never copied).

from tqdm import tqdm
import torch


class Dset(torch.utils.data.Dataset):
    def __getitem__(self, idx):
        return torch.randint(255, size=(2, 20), dtype=int)

    def __len__(self):
        return 100_000


dloader = torch.utils.data.DataLoader(Dset(), batch_size=1, num_workers=8)
stats = []
for items in tqdm(dloader):
    for item in items:
        stats.append(item.numpy()) # will give you the error after ~20-90k batches
        #stats.append(item.numpy().copy()) # works without error

HELLORPG · December 19, 2023, 3:37am

I have the same issue. I try to use pin_memory=True in my dataloader, and now I do not receive this error message.