Let’s say I have a dataset with 800 samples. I train on 8 GPUs. This launches 8 training processes, each of which consumes a shard of 100 samples.
However, there’s an issue where a sample has, let’s say, a 10% chance of being faulty.
This means some dataloaders, purely by luck, will finish faster than other dataloaders.
Now, I’m not 100% sure, but I think this is causing the following deadlock: Timed out receiving the shared seed from the distribtued store on Rank 2 · Issue #85775 · pytorch/pytorch · GitHub.
Basically, one data loader is finishing faster than the others. This is causing some sort of deadlock related to shared state.
Does anyone have experience training with differently sized dataloaders?