I'm trying to gather some suggestions about how to implement a video loader implementing the
class torch.utils.data.Dataset, so that it can be fed to
Directory of multiple folders containing multiple MP4s with minimum size scaled to
I could think of my multiple videos as a big long video of length
tot_nb_frames which will be reshaped into something like
tot_nb_batches x batch_size x height x width x 3, where
tot_nb_batches = tot_nb_frames // batch_size.
Now I have that
list (why not a
tuple???) of ordered numbers
list(range(t * batch_size, (t + 1) * batch_size))
t in the interval
dataset[i] for i in indeces
return the correct next frame for each row of the batch, so
dataset should have an appropriate internal mapping, which is based on
batch_size, attribute of
DataLoader and not of
Question / advices
Can anyone provide feedback on this strategy? Does it sound reasonable, or am I missing something?
Given that the mapping is based on
batch_size I am now thinking whether this should be performed by
Nevertheless, given a specific initial mapping, video readers should be initialised with different seeks. So,
DataLoaderIter should call an initialisation method of
Dataset, but I think that this is not currently supported.
Oh, well, I could have a lazy approach. Initialise the reader the first time a specific frame is requested. And, yeah, the indexing should be done from the
DataLoaderIter side, since the
Dataset should not care about batching at all.
Say I perform the mapping with the
Sampler, I will have variable batch sizes,
batch_size - 1.
0 5 10 15 20 25
After asking batch
23 the batch size should decrease by one. This is a mess, since
_next_indices() is still going to ask for
batch_size amount of data, screwing up everything.
Hacky solution 1
I could have the
batch_size = 1, and have the
Dataset object returning batches (columns) itself, and
sqeeze() the singleton later on. But it looks nasty...
Hacky solution 2
Given that the missing data (bottom right corner) is
< batch_size, and this gets as big as
128, usually, I could simply return a duplicate of the last frame, at worse,
127 time, which is roughly
2 seconds of video, compared to the hours of data. So... I think it's just fine. Otherwise, I could use the beginning of video
0. I think I'll opt for this way.
The whole should look like this.
0 5 10 15 20 25
> represents the head of a generic video and
x its subsequent frames,
0 represents the head of video zero, and
o its frames.