I’m building an NLP application that with a dataloader that builds batches out of sequential blocks of text in a file. I have been using an IterableDataset since my text file won’t fit into memory. However, when I use with with DistributedDataParallel, the dataloader is replicated across processes and each GPU ends up with the same batch of data. How can I give each GPU a different batch of data to take advantage of distributed training?
Note: My dataloader can load ~300 batches/second on a single and each GPU takes ~2 seconds to process a batch, so dataloader speed should not be a significant limiting factor even if batches were sent in serial to different GPUs