Best way to handle a data pipeline where training inputs are randomly sampled and retrieved from a 2nd dataset with

Here is an example what I am trying to do. Here’s what my dataset may look like

2   | 34, 64, 2243, 55678, 323, 778, 4454, 23, 4433, 3445, 455, 32
23  | 343 ,56, 2, 5, 675, 34, 232, 677, 7, 54, 436, 77, 85, 33
592 | 343, 54, 4, 6
23  | 34
123 | 2, 4, 54, 38, 6643, 67, 3

Each of these numbers are indexes (not the actual model inputs/target data) which point to the actual data that will be fed into the model. The actual model data is in a separate dataset, organized by indexes (same number used in the first dataset), so it would look something like this

1 | data1
2 | data2
3 | data3

For the first set, on the left side are indexes to the targets, so 2, 23, 592, 23, 123. On the right side are indexes to the potential inputs, and for each target, 4 inputs are picked at random. For targets which have a number of inputs less than 4, inputs are repeated.

So taking the first training example in the first set, the model input would look like this

target = torch.tensor([data2])
input = torch.mean( torch.tensor([data34]), torch.tensor([data2243]), torch.tensor([data23]), torch.tensor([data32]) )

Both sets are too big to hold in memory at the sametime, and I don’t have enough diskspace to save targets with every permutation of inputs.

I think the best strategy is to have X number of batches set up ahead of time, using multithreading to setup future batches, so that the data pipeline does’t become the bottle neck in the training, but I am wondering how to go about doing that.


I think my description is a bit hazy, so I made an analogy of the type of data pipeline I’m trying to develop.

An analogy would be some types of word embedding trainings, where the number of input words is smaller than the context window. So for a particular target word, the input words need to be chosen at random within the context window of that target word.

For example, take this sentence:

There once was a very fast dog who could out run any other animal in the city.

Say the target word was dog, and that the context window size was 6, but the input size was 2. So we have a choice of ‘a’, ‘very’, ‘fast’, ‘who’, ‘could’, ‘out’, of which we need to pick two inputs randomly. So an example training example would be the word embedding for ‘dog’ as the target, and the word embeddings of ‘fast’ and ‘out’ for the inputs.

So in word embedding trainings, many different inputs will be used in many different targets. But with word embedding trainings, all the word embeddings are able to be stored in live memory since the vocab is 6 figures. For my case, the inputs can’t all be in live memory.

So an idea of what I’m looking to do is word2vec training but most of the word embeddings have to stay in disk space at any particular instance.