Problems with fork() in DataLoader with num_workers > 1

tanhevg · December 14, 2024, 2:50am

Throughout my work I have run into multiple issues with hung Dataloaders with num_workers > 1. Turns out this is a known issue with fork(), see, for example, this article. I see lots of material on the Internet suggesting fork is a bad practice (e.g. polars). spawn is rumoured to become the default method for starting child processes in Python 3.14. Wouldn’t it make sense to make spawn the default method for child processes for DataLoaders in PyTorch (on Linux, I understand other OSes already use spawn)? Or at least mention the potential for deadlocks and the workaround with spawn in the docs…

ptrblck · December 16, 2024, 4:05pm

I don’t know what kind of errors you are seeing, but the multiprocessing docs already ask users to use spawn in case you want to use CUDA in your workflow (e.g. your data loading):

The CUDA runtime does not support the fork start method; either the spawn or forkserver start method are required to use CUDA in subprocesses.

tanhevg · December 17, 2024, 11:59am

Thanks. The errors I am getting are deadlocks, followed by a killed DataLoader worker - see the linked post.

Yes, you are right that the documentation warns against using fork. Although I generally operate with CPU tensors in subprocesses, and move them to CUDA in the main process. What is confusing is that in this scenario fork actually works, at least for a very trivial example. With a more complex use case it hangs. At the same time fork is the default in Linux. So, it is very easy to run into a hard to debug issue by just setting num_workers to a positive value in the DataLoader constructor, without realising what you are doing wrong.

For example, polars is raising a warning in a similar scenario. Wonder if pytorch should raise a warning, too, or maybe even an error.