How to do distributed training when each data loader may have a different # of samples?

Let’s say I have a dataset with 800 samples. I train on 8 GPUs. This launches 8 training processes, each of which consumes a shard of 100 samples.

However, there’s an issue where a sample has, let’s say, a 10% chance of being faulty.
This means some dataloaders, purely by luck, will finish faster than other dataloaders.

Now, I’m not 100% sure, but I think this is causing the following deadlock: Timed out receiving the shared seed from the distribtued store on Rank 2 · Issue #85775 · pytorch/pytorch · GitHub.

Basically, one data loader is finishing faster than the others. This is causing some sort of deadlock related to shared state.

Does anyone have experience training with differently sized dataloaders?