Feeding Multiprocessing IterableDataset from a single source

tsteffek · April 19, 2020, 8:11am

I’m trying to train a BILSTM as language model.

I have a list of files, all containing varying length documents (1 line up to 1000 lines). I’ve been following this tutorial, so I’m creating multiple IterableDatasets, each grabbing a file, streaming its contents until its empty, rinse and repeat. This works all fine.

Now since my files are of such different lengths, I’d like to avoid just splitting the list and passing shards to each dataset. My initial thought was to pass an Iterator (just gets copied, multiplying my data) or queue.Queue / queue.SimpleQueue, but that fails when pickling. (Is this a windows only error? But pickling seems to be required in CUDA environments anyway, so I guess I’ll have the same problem in Linux)

TLDR; Is there a way to pass some sort of single source of truth into the IterableDatasets, or a way for them to communicate? Just sharing the information of passed files would be enough.

tsteffek · April 19, 2020, 8:29am

Nevermind, I just realised that it works with torch.multiprocessing.Queue.