I’m encountering a freeze during distributed data parallel training on 8 GPUs with multiple data loader workers. The training process runs for some time and then abruptly stops without any output, error messages, or noticeable resource exhaustion (GPU and CPU utilization appear normal). Any insights or troubleshooting suggestions would be greatly appreciated.
Thanks
Do you see this hang only if multiple workers are used?
After investigating the code further, I noticed that I’m currently sharding files by rank. When a rank finishes processing its data early, the entire training process seems to stall, likely because the remaining ranks are still waiting to complete their gradient updates.
Is this understanding correct?
I’m using TorchRec’s model parallelism in combination with an IterableDataset
to stream data into memory.
Is there a supported way to combine data parallelism with model parallelism in TorchRec?
Or is it expected that I manually manage data sharding and ensure balanced record distribution across ranks?
Yes, DDP would require the same number of batches on all ranks, which is often achieved with the DistributedSampler
. Since you are using an IterableDataset
I would assume you need to make sure the same number of batches are provided for all ranks or you could also skip (some) synchronizations via the no_sync()
context manager if possible (e.g. this might allow you to run a few gradient accumulation steps on ranks with more batches).
I’m not familiar enough with TochRec
to know the best approach for data and model parallel strategies.