Multiprocessing with Multiple Nested Datasets

Does anyone have any suggestions for handling multiple large nested datasets with > 1 workers, and for making them appear as a single dataset?

For example:

  • I have 10 datasets, each has say 5,000 items, and each item has 1,000 (potentially large, memory-wise) values. I want each batch to contain n values and to iterate over the combined 10 datasets, BUT, I don’t want to load every item into memory.
  • I also want to do this with 12 workers, and multiple GPUs.

I can do this with some crafty indexing, and cache each file as needed, which is fine with one worker, but with multiple workers, I have no way of controlling which worker has which file, they end up each having to load the same file just to get nearby values.

For example if I had 10 workers, worker one might have index 1, worker two might have index 2, etc., but each index comes from the same file. So now each worker is basically loading the same file, which is rather slow.

Hope this makes sense. I can restructure my dataset but I’d like to avoid it if possible.

FWIW, I am doing a word embedding here, so each dataset is one language, and each item/file is a corpus, and each value is a word+context.