I’m trying to load all the dataset image files into memory but I’m facing the following error:
Exception in thread Thread-3 (_handle_results):
Traceback (most recent call last):
File "/home/mehran/.conda/envs/gtn_env/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/mehran/.conda/envs/gtn_env/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/mehran/.conda/envs/gtn_env/lib/python3.10/multiprocessing/pool.py", line 579, in _handle_results
task = get()
File "/home/mehran/.conda/envs/gtn_env/lib/python3.10/multiprocessing/connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
File "/home/mehran/.conda/envs/gtn_env/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 514, in rebuild_storage_filename
storage = torch.UntypedStorage._new_shared_filename_cpu(manager, handle, size)
RuntimeError: unable to mmap 23104 bytes from file </torch_6855_1827127219_18231>: Cannot allocate memory (12)
I understand that this error is complaining about the lack of memory but I’m monitoring my memory usage and it’s hardly half used. While the code is too complex to mention here completely, here are some snippets that might help you help me:
torch.multiprocessing.set_sharing_strategy('file_system')
def load_image(file_path, data_path):
transform = v2.Compose([
lambda x: x.convert('L'),
lambda x: TF.to_tensor(x),
lambda x: x * 255.0,
lambda x: x.type(torch.int8),
])
sample = Image.open(os.path.join(data_path, file_path))
image = transform(sample)
return image
with mp.Pool(processes=16) as pool:
samples = pool.map(partial(load_image, data_path=data_path),
filenames)
print("Dataset loaded")
This error happens when I cast the tensors into int
and I do that since if I don’t the dataset won’t fit into my memory. In that scenario, I can see the memory usage reaching my physical memory limits and that’s totally acceptable. But in this case, as I said, it hardly reached half my machine’s capacity and it errors out.
The funny thing is that after this error is raised, the code does not exit. It continues going but the CPU usage goes down (only one core will be 100% while previously all cores where engaged) and the script does not reach the line after.