Hi, I have some particular needs for how to sample my training data, and am hoping someone can suggest a way to do it while still benefitting from the built in utilities:
Dataset: N recordings, each consisting of variable number of data points
Network input: L consecutive data points (L much smaller than each recording). Let’s call this an L-sequence.
Ideally, for each batch I would like to draw m L-sequences, from random places in my data set. I would like each L-sequence to have a random midpoint, so sequence overlaps within the same epoch would be allowed. For this reason, I would like to be able to stop an epoch before all possible L-sequences have been drawn (otherwise I am passing each data point to the network L times, which is probably overkill).
I also need my L-sequences to not cross boundaries between recordings.
L is small enough that setting m=1 incurs a pretty hefty performance cost (each epoch takes about 3 times as long, compared to the optimal batch size), so I would like to pass several L-sequences to the GPU at a time.
Is there a way to get most of this while still using the data loader class?