Unable to allocate shared memory(shm) for file

RuntimeError: unable to allocate shared memory(shm) for file </torch_206732_4264974204_15>: Resource temporarily unavailable (11)

I am not sure how to get around this issue. I do not have direct control over /dev/shm. it is 4 GB in size. I’d like to keep num_workers at the max RAM allows. I’ve tried other things like I’ve come across like setting torch multiprocessing_context to spawn, but it does not work. file_system is the only sharing strategy available. There is no retention of the batch that I’ve seen. My Dataset returns np arrays, and my collate function converts to tensor, if that is relevant. And I have made sure my Dataset is storing minimal metadata in memory.

Thanks!

Adam

You might need to reduce the number of workers if you cannot increase the shm size.

Thanks @ptrblck . I learned that /dev/shm and size of data returned by a batch are related, and reducing the size of the data returned by the Dataset was a workaround. Each batch contained large tensors that were getting written to /dev/shm. RAM was fine, but /dev/shm was not. This was highly confusing. I wish pytorch gave a warning with the size of each tensor in the batch as it relates to /dev/shm.

@ptrblck I’m actually rather confused. I reduced the batch size from ~1 GB to 275 MB. The /dev/shm size is 4 GB. After reducing to 275 MB I’m able to use 4 GPUs with 12 workers each and prefetch_factor of 2. So that is 275 MB x 4 x 12 x 2 = 11 GB. But I have no issue with /dev/shm filling up anymore. I was thinking thinking that the total data prefetched by all workers would need to be under the 4 GB /dev/shm limit but seems not to be the case, and just reducing the batch size from 1 GB to 275 MB seems to be under the limit.

The workers will not store their batches directly in /dev/shm. Instead /dev/shm will be used to move the ready batches from the worker processes to the main process so the usage does not directly correlate to the prefetch factor and batch size.