Torch DataLoader multiprocessing, could not unlink shared memory file

Hello,

I’m getting the following error with a multi-worker DataLoader

2022-11-25 23:18:06 [info] RuntimeError: falseINTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/aten/src/ATen/MapAllocator.cpp":319, please report a bug to PyTorch. could not unlink the shared memory file /torch_2999188_550744154

I’m using a PyTorch DataLoader to load data on-the-fly since my full dataset does not fit into RAM. With multiple workers the data loading time is hidden by the model forward, this is not the case when num_workers = 0 so this isn’t what I’m hoping for as a solution.

It’s a bit strange since everything is ok for a couple of epochs (which is on the order of a few hours), but then at some point one of the splits crashes with this error.

Does anyone have any idea what this might be caused by, and how to go about fixing it?

My dataloader is wrapping a dataset that I have ensured only contains elements with a fixed refcount to ensure we dont hit the memory issues from this thread, so I dont think its related to that.

Thanks!

1 Like