I am trying to load the huge coco dataset (120000-image) and do some training. I am using my docker container for the task.
For faster training I try to load the whole data using pytorch dataloader into a python array (on the system memory not the gpu memory), and feed the model with that python array, so I won’t use the dataloader during the training.
The problem is after loading a bunch of data (something around 10-15 GB) I encounter this strange error:
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 116, in get
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 294, in rebuild_storage_fd
storage = cls._new_shared_fd(fd, size)
RuntimeError: falseINTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/aten/src/ATen/MapAllocator.cpp":303, please report a bug to PyTorch. unable to mmap 8 bytes from file <filename not specified>: Cannot allocate memory (12)
I know it’s a huge data but the numbers match. I am using a linux server which has 200GB memory, so it’s not a memory lacking problem. I also set the docker shared memory on 200 GB to overcome the docker limit on memory usage, but still the problem exist.
For smaller dataset (which needs less than 20GB) my code works fine.
What should I do for this volume of data? What did I miss?
Where and how did you add this argument?
In the dataloader:
dataloaders = torch.utils.data.DataLoader( image_datasets['train']
in the documentation for more details.
The question is, why doesn’t it work with multiple DataLoader workers?
You did not solve the problem, you circumvented it.
I also experience this with FashionMNIST using the latest Pytorch 1.11 (torch vision 0.12.0).
I’m getting the following error:
unable to mmap 3136 bytes from file </torch_29142_1440317042_64566>: Cannot allocate memory (12)
@ptrblck Do you know what kind of error this is, and why it could occur specifically when multiple data loaders are used?
Unsure what’s causing the issue but based on e.g. this older bug description it seems as if Python fails to allocate the resident size for a new process.
How much system RAM do you have and how much does the main Python process need?
I also met the same problem and solved it by modifying num_workers to 0 as well. Even I changed num_workers to 1, this “cannot allocate memory” problem would recur. I still wonder why this problem would happen?
Having the same issue here, setting
num_workers to any value beyond 0 will result in cannot allocate memory, but using
num_workers 0 will result in an extremely slow training process.
I believe I met the same problem, and as other people here reported, only when num_worker=0 is bug-free
I have the same issue. Are there any news on this bug? Setting num_workers to 0 works as workaround but is not a good solution as it takes a lot of time in my case.