I’m trying to train a BILSTM as language model.
I have a list of files, all containing varying length documents (1 line up to 1000 lines). I’ve been following this tutorial, so I’m creating multiple
IterableDatasets, each grabbing a file, streaming its contents until its empty, rinse and repeat. This works all fine.
Now since my files are of such different lengths, I’d like to avoid just splitting the list and passing shards to each dataset. My initial thought was to pass an
Iterator (just gets copied, multiplying my data) or
queue.SimpleQueue, but that fails when pickling. (Is this a windows only error? But pickling seems to be required in CUDA environments anyway, so I guess I’ll have the same problem in Linux)
TLDR; Is there a way to pass some sort of single source of truth into the
IterableDatasets, or a way for them to communicate? Just sharing the information of passed files would be enough.