Hello. I am training a language model using DDP on multiple compute nodes. The training data I’m using is a large set of millions text sequences and it doesn’t fit in my memory so I will have to load batches on the fly. Since each sequence may have different lengths, I want to concatenate short sequences using an eos token to avoid wasted computation on padding tokens. Also, I need to implement the dynamic loading in such a way the all my nodes are getting different data, ideally it would be shuffled as well. Is there a good way of doing this?
My current thinking is take total number of nodes and assign each node a (manageably sized) chunk of the total data available, then locally create a different dataset from that chunk and train with that. Would this approach work? Also, if one of the chunks is shorter than the others causing that node to finish early, will that interfere with the ddp training?