Could not unlink the shared memory file, potential race with systemd

I want to be able to add some additional flag in the filenames dumped to /dev/shm to ensure no clashes between workers.

When using a DataLoader with multiple workers it dumps a load of files to /dev/shm, as explained in this link

Now suppose I have a 4GPU box and run a process on each, each spinning up a multiprocessing dataloader. Each dumps information independently to /dev/shm, I’ve found that training can be fine for a few epochs, then crash due to the following error

could not unlink the shared memory file /torch_2999188_550744154

Current hypothesis is that two workers have somehow ended up using the same filename, and one has unlinked it meaning another breaks. Not sure if this is true, but is there a way to modify the file location to torch_{rank}_2999188_550744154 where rank is the rank number of the process?

That’s an interesting thought, but note that the file name is created via the pid and a random number as seen here:

TORCH_API std::string NewProcessWideShmHandle()
{
  static std::atomic<uint64_t> counter{0};
  static std::random_device rd;
  std::string handle = "/torch_";
#ifdef _MSC_VER
  handle += c10::guts::to_string(GetCurrentProcessId());
#else
  handle += c10::guts::to_string(getpid());
#endif
  handle += "_";
  handle += c10::guts::to_string(rd());
  handle += "_";
  handle += c10::guts::to_string(counter.fetch_add(1, std::memory_order_relaxed));
  return handle;
}

Based on this the difference processes should not be able to create the same file as their pid would be different (and additionally they would need to resample the same random number).

In any case, it seems you have a code snippet which can reproduce the error (after a few epochs), so you could add the rank to the file name, rebuild PyTorch, and see if this would help.

Thanks for the link.

I see, so unlikely this is true.

Can you explain why the above code looks like it should produce three underscores, but the files are always labelled as ‘torch_2999188_550744154’ with two numbers. Is pid definitely being added?

Maybe your PyTorch version is too old and doesn’t use the latest changes?
You can git blame the file and would see when these PRs were merged (e.g. this one and this one).

Ah yes, perfect! I should have thought to check that, thanks very much.

Hi @ptrblck, I’ve upgraded to torch1.13 and it definitely works better, however I’m still getting errors
(albeit less frequently)

RuntimeError: could not unlink the shared memory file /torch_1333031_3417509106_15235 : No such file or directory (2)

It now has the PID attached, but the issue persists.

Is there any straightforward way to just ignore this and turn it into a warning? Seems super rare so I would be happy to just ignore any instances where this happens, I dont know how one would skip through runtime errors in multiprocessing though.

Posting a solution in case anyone comes across this error down the line

This solution was found by a colleague, not by me:

when data is loaded using a DataLoader with num_workers > 0, temporary files in /dev/shm are created and used as temporary storage. Filename is made of the pid of the calling thread, a random number and an incrementing number: (pytorch/MapAllocator.cpp at v1.13.0 · pytorch/pytorch · GitHub), also as can be seen in pytorch/MapAllocator.cpp at v1.13.0 · pytorch/pytorch · GitHub the file is immediately unlinked, but because the file descriptor is kept open, it can still be used to store and read data. Systemd by default cleans up various shared memory related resources for non-system users and I think because the user is not logged into the system, it sees the file created in /dev/shm and unlinks it before the code in MapAllocator.cpp#L345 gets to it. So it is a race condition that systemd is very unlikely to win, but sometimes it does. This situation is described much better by PostgreSQL: Documentation: 15: 19.4. Managing Kernel Resources

Solution was to rebuild the image with RemoveIPC=no

@ptrblck is there any way to change the title to something more useful

e.g. could not unlink the shared memory file using torch version where name clash bug has been fixed

Thanks for the follow up and description of the issue! Yes, I can change the title for you.