I am trying to load the huge coco dataset (120000-image) and do some training. I am using my docker container for the task.
For faster training I try to load the whole data using pytorch dataloader into a python array (on the system memory not the gpu memory), and feed the model with that python array, so I won’t use the dataloader during the training.
The problem is after loading a bunch of data (something around 10-15 GB) I encounter this strange error:
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 116, in get
return _ForkingPickler.loads(res)
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 294, in rebuild_storage_fd
storage = cls._new_shared_fd(fd, size)
RuntimeError: falseINTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/aten/src/ATen/MapAllocator.cpp":303, please report a bug to PyTorch. unable to mmap 8 bytes from file <filename not specified>: Cannot allocate memory (12)
I know it’s a huge data but the numbers match. I am using a linux server which has 200GB memory, so it’s not a memory lacking problem. I also set the docker shared memory on 200 GB to overcome the docker limit on memory usage, but still the problem exist.
For smaller dataset (which needs less than 20GB) my code works fine.
What should I do for this volume of data? What did I miss?
Unsure what’s causing the issue but based on e.g. this older bug description it seems as if Python fails to allocate the resident size for a new process.
How much system RAM do you have and how much does the main Python process need?
I also met the same problem and solved it by modifying num_workers to 0 as well. Even I changed num_workers to 1, this “cannot allocate memory” problem would recur. I still wonder why this problem would happen?
Having the same issue here, setting num_workers to any value beyond 0 will result in cannot allocate memory, but using num_workers 0 will result in an extremely slow training process.
I have the same issue. Are there any news on this bug? Setting num_workers to 0 works as workaround but is not a good solution as it takes a lot of time in my case.
Hi @ptrblck. Do you know if there any update on the issue? Setting workers to zero is not really a solution for large datasets. I am experiencing the issue even on a system with 880 GB of RAM.
No, sorry I don’t have any updates as I was never able to reproduce the issue. So far nobody was able to post a minimal and executable code snippet which would raise this error in a current release, so I don’t have a way to debug it.
I can reproduce the issue consistently in an environment that I is too big to share it in a code snippet. I will work on a simplified repro that I can share. In the meantime, do you have additional hints for troubleshooting?
Here are some info about the setup:
Image multilabel classification with >700k images with DistributedDataParallel
Training runs in Docker container
96 cores, 880 GB RAM, 4 A100 GPU
Shared mem set to >200GB
“Cannot allocate memory” occurs in eval pass of first epoch
no_workers = 20
Error does not occur when setting no_workers to zero (as others have experienced) but makes training very slow
Thanks for any guidance!
I don’t know if you are seeing the same stacktrace, but the first posted one shows the failure in storage = cls._new_shared_fd(fd, size),so I would start by checking:
how many file descriptors are already open in the process (e.g. via lsof or by reading /prod/pid/fd)
all allocated and max. handles (e.g. via sysctl fs.file-nr)
if your docker setup somehow limits the file descriptor usage.
Thank you very much, @ptrblck
Will try your suggestions. Here is the stacktrace (env name redacted):
Traceback (most recent call last):
File “/mnt/azureml/cr/j//exe/wd/train.py”, line 256, in
for i, batch_data in enumerate(valid_loader):
File “/azureml-envs//lib/python3.10/site-packages/torch/utils/data/dataloader.py”, line 530, in next
data = self._next_data()
File “/azureml-envs//lib/python3.10/site-packages/torch/utils/data/dataloader.py”, line 1207, in _next_data
idx, data = self._get_data()
File “/azureml-envs//lib/python3.10/site-packages/torch/utils/data/dataloader.py”, line 1173, in _get_data
success, data = self._try_get_data()
File “/azureml-envs//lib/python3.10/site-packages/torch/utils/data/dataloader.py”, line 1011, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File “/azureml-envs//lib/python3.10/multiprocessing/queues.py”, line 122, in get
return _ForkingPickler.loads(res)
File “/azureml-envs//lib/python3.10/site-packages/torch/multiprocessing/reductions.py”, line 300, in rebuild_storage_fd
storage = cls._new_shared_fd(fd, size)
RuntimeError: unable to mmap 256 bytes from file : Cannot allocate memory (12)
Do you mean num_woker in DataLoader? We need to be careful that total samples to be processed (num_woker * memory_size_of_one_batch) will fit in the RAM.
No, I meant, the dataset we used is relatively small, around 10 GiB in total, and we have 1.8 TiB memory installed, more that 100 times of the size of dataset.
(num_woker * memory_size_of_one_batch) takes less than 0.001% of usable memory.