I want to be able to add some additional flag in the filenames dumped to /dev/shm to ensure no clashes between workers.
When using a DataLoader with multiple workers it dumps a load of files to /dev/shm, as explained in this link
Now suppose I have a 4GPU box and run a process on each, each spinning up a multiprocessing dataloader. Each dumps information independently to /dev/shm, I’ve found that training can be fine for a few epochs, then crash due to the following error
could not unlink the shared memory file /torch_2999188_550744154
Current hypothesis is that two workers have somehow ended up using the same filename, and one has unlinked it meaning another breaks. Not sure if this is true, but is there a way to modify the file location to torch_{rank}_2999188_550744154 where rank is the rank number of the process?
Based on this the difference processes should not be able to create the same file as their pid would be different (and additionally they would need to resample the same random number).
In any case, it seems you have a code snippet which can reproduce the error (after a few epochs), so you could add the rank to the file name, rebuild PyTorch, and see if this would help.
Can you explain why the above code looks like it should produce three underscores, but the files are always labelled as ‘torch_2999188_550744154’ with two numbers. Is pid definitely being added?
Maybe your PyTorch version is too old and doesn’t use the latest changes?
You can git blame the file and would see when these PRs were merged (e.g. this one and this one).
Hi @ptrblck, I’ve upgraded to torch1.13 and it definitely works better, however I’m still getting errors
(albeit less frequently)
RuntimeError: could not unlink the shared memory file /torch_1333031_3417509106_15235 : No such file or directory (2)
It now has the PID attached, but the issue persists.
Is there any straightforward way to just ignore this and turn it into a warning? Seems super rare so I would be happy to just ignore any instances where this happens, I dont know how one would skip through runtime errors in multiprocessing though.
Posting a solution in case anyone comes across this error down the line
This solution was found by a colleague, not by me:
when data is loaded using a DataLoader with num_workers > 0, temporary files in /dev/shm are created and used as temporary storage. Filename is made of the pid of the calling thread, a random number and an incrementing number: (https://github.com/pytorch/pytorch/blob/v1.13.0/aten/src/ATen/MapAllocator.cpp#L44), also as can be seen in https://github.com/pytorch/pytorch/blob/v1.13.0/aten/src/ATen/MapAllocator.cpp#L345 the file is immediately unlinked, but because the file descriptor is kept open, it can still be used to store and read data. Systemd by default cleans up various shared memory related resources for non-system users and I think because the user is not logged into the system, it sees the file created in /dev/shm and unlinks it before the code in MapAllocator.cpp#L345 gets to it. So it is a race condition that systemd is very unlikely to win, but sometimes it does. This situation is described much better by PostgreSQL: Documentation: 16: 19.4. Managing Kernel Resources
Solution was to rebuild the image with RemoveIPC=no