Pytorch dataloader is freezing

alifani · May 12, 2025, 1:33am

I’m encountering a freeze during distributed data parallel training on 8 GPUs with multiple data loader workers. The training process runs for some time and then abruptly stops without any output, error messages, or noticeable resource exhaustion (GPU and CPU utilization appear normal). Any insights or troubleshooting suggestions would be greatly appreciated.

Thanks

ptrblck · May 12, 2025, 6:18pm

Do you see this hang only if multiple workers are used?

alifani · May 16, 2025, 8:30pm

After investigating the code further, I noticed that I’m currently sharding files by rank. When a rank finishes processing its data early, the entire training process seems to stall, likely because the remaining ranks are still waiting to complete their gradient updates.

Is this understanding correct?

I’m using TorchRec’s model parallelism in combination with an IterableDataset to stream data into memory.

Is there a supported way to combine data parallelism with model parallelism in TorchRec?
Or is it expected that I manually manage data sharding and ensure balanced record distribution across ranks?

ptrblck · May 16, 2025, 10:40pm

Yes, DDP would require the same number of batches on all ranks, which is often achieved with the DistributedSampler. Since you are using an IterableDataset I would assume you need to make sure the same number of batches are provided for all ranks or you could also skip (some) synchronizations via the no_sync() context manager if possible (e.g. this might allow you to run a few gradient accumulation steps on ranks with more batches).

I’m not familiar enough with TochRec to know the best approach for data and model parallel strategies.