I’m currently training (or trying to) a small transformer model, and when looking at the processes running, I don’t just see the name of my python process, but also several zombie (<defunct>) processes just called [pt_main_thread]. Their number correlates with the number of dataloader workers, and they keep respawning and disappearing, since their PIDs change over time. There are no error messages in the output. I am training the model on a single GPU and have tried varying numbers of dataloaders.
Maybe this is to be expected, but I wanted to confirm before looking for bugs in my code or reporting any to Pytorch.
While I can’t provide a definitive answer, I’ve observed a similar phenomenon when training a large vision model on a WebDataset using torchrun.
During the first epoch, typically midway through, I notice a temporary slowdown in training. Concurrently, a few [pt_main_thread] <defunct> processes appear. After several seconds, training resumes at approximately 90% of the original pace. Full performance is usually restored within a few minutes, and the remainder of the training proceeds without issues.
This behavior might be related to your observation, though I can’t confirm if it’s the same underlying cause. It could be worth investigating if this is a known issue or expected behavior in PyTorch.