Custom data sampling (sequences from different recordings)

Hi, I have some particular needs for how to sample my training data, and am hoping someone can suggest a way to do it while still benefitting from the built in utilities:

Dataset: N recordings, each consisting of variable number of data points

Network input: L consecutive data points (L much smaller than each recording). Let’s call this an L-sequence.

Ideally, for each batch I would like to draw m L-sequences, from random places in my data set. I would like each L-sequence to have a random midpoint, so sequence overlaps within the same epoch would be allowed. For this reason, I would like to be able to stop an epoch before all possible L-sequences have been drawn (otherwise I am passing each data point to the network L times, which is probably overkill).
I also need my L-sequences to not cross boundaries between recordings.

L is small enough that setting m=1 incurs a pretty hefty performance cost (each epoch takes about 3 times as long, compared to the optimal batch size), so I would like to pass several L-sequences to the GPU at a time.

Is there a way to get most of this while still using the data loader class?

This post explains a simple windowed Dataset, where non-overlapping windows would be created, while this post is a bit more complicated in order to avoid “mixing” data samples from different sources.
Maybe one of these two approaches could be useful for you.

1 Like