I am trying to load the huge coco dataset (120000-image) and do some training. I am using my docker container for the task.
For faster training I try to load the whole data using pytorch dataloader into a python array (on the system memory not the gpu memory), and feed the model with that python array, so I won’t use the dataloader during the training.
The problem is after loading a bunch of data (something around 10-15 GB) I encounter this strange error:
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 116, in get
return _ForkingPickler.loads(res)
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 294, in rebuild_storage_fd
storage = cls._new_shared_fd(fd, size)
RuntimeError: falseINTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/aten/src/ATen/MapAllocator.cpp":303, please report a bug to PyTorch. unable to mmap 8 bytes from file <filename not specified>: Cannot allocate memory (12)
I know it’s a huge data but the numbers match. I am using a linux server which has 200GB memory, so it’s not a memory lacking problem. I also set the docker shared memory on 200 GB to overcome the docker limit on memory usage, but still the problem exist.
For smaller dataset (which needs less than 20GB) my code works fine.
What should I do for this volume of data? What did I miss?
Unsure what’s causing the issue but based on e.g. this older bug description it seems as if Python fails to allocate the resident size for a new process.
How much system RAM do you have and how much does the main Python process need?
I also met the same problem and solved it by modifying num_workers to 0 as well. Even I changed num_workers to 1, this “cannot allocate memory” problem would recur. I still wonder why this problem would happen?
Having the same issue here, setting num_workers to any value beyond 0 will result in cannot allocate memory, but using num_workers 0 will result in an extremely slow training process.
I have the same issue. Are there any news on this bug? Setting num_workers to 0 works as workaround but is not a good solution as it takes a lot of time in my case.