Throughout my work I have run into multiple issues with hung Dataloaders with num_workers > 1. Turns out this is a known issue with fork(), see, for example, this article. I see lots of material on the Internet suggesting fork
is a bad practice (e.g. polars). spawn
is rumoured to become the default method for starting child processes in Python 3.14. Wouldn’t it make sense to make spawn
the default method for child processes for DataLoaders in PyTorch (on Linux, I understand other OSes already use spawn)? Or at least mention the potential for deadlocks and the workaround with spawn in the docs…
I don’t know what kind of errors you are seeing, but the multiprocessing
docs already ask users to use spawn
in case you want to use CUDA in your workflow (e.g. your data loading):
The CUDA runtime does not support the
fork
start method; either thespawn
orforkserver
start method are required to use CUDA in subprocesses.
Thanks. The errors I am getting are deadlocks, followed by a killed DataLoader worker - see the linked post.
Yes, you are right that the documentation warns against using fork
. Although I generally operate with CPU tensors in subprocesses, and move them to CUDA in the main process. What is confusing is that in this scenario fork
actually works, at least for a very trivial example. With a more complex use case it hangs. At the same time fork
is the default in Linux. So, it is very easy to run into a hard to debug issue by just setting num_workers
to a positive value in the DataLoader
constructor, without realising what you are doing wrong.
For example, polars is raising a warning in a similar scenario. Wonder if pytorch should raise a warning, too, or maybe even an error.