Speeding up a dataset


I have a dataset of video frames in a folder with the directory structure looking like this:

- parent_dir

My code, gathering frames into a Pytorch dataset and splitting them up into patches is quite slow in the __init__ method/phase when scaled up to quite a lot of frames/videos. Might there be a more efficient way to create the dataset?


class Dataset(torch.utils.data.Dataset):
    def __init__(self, directory='../data/*', get_subdirs=True, size=(16,16), max_ctx_length=4096, dry_run=False):
        print("Loading dataset...")
        self.data = glob.glob(directory)
        if get_subdirs:
            data_temp = []
            for p, i in enumerate(self.data):
                print("Loading data from {0}. {1} more to go...".format(i, len(self.data)-p))
                file_data = glob.glob(i+"/*")
                file_data.sort(key=lambda r: int(''.join(x for x in r if (x.isdigit())))) #This sort...
                if dry_run:
        self.data = data_temp
        self.max_ctx_length = max_ctx_length
        self.size = size
    def __len__(self):
        return len(self.data)*self.size[0]-self.max_ctx_length-1
    def __getitem__(self, key):
        frame_start = int(np.floor(key / self.size[0]))
        patch_start = int(np.mod(key, self.size[0]))
        patches = []
        i_frame = frame_start

        while len(patches) <= self.max_ctx_length+1:
            frame = (Tvio.read_image(self.data[i_frame], mode=Tvio.ImageReadMode.RGB).float() / 255).unsqueeze(0)
            if len(patches) == 0:
                patches.extend(F.unfold(frame, self.size, stride=self.size).transpose(1,2).split(1,1)[patch_start:])
                patches.extend(F.unfold(frame, self.size, stride=self.size).transpose(1,2).split(1, 1))
            i_frame += 1
        patches = patches[:self.max_ctx_length+1]

        data_x = patches[0:-1]
        data_y = patches[1:]

        return torch.cat(data_x, dim=1).squeeze(0), torch.cat(data_y, dim=1).squeeze(0)

As it is written, it looks like the init function could be parallelized across each item in self.data. Would using something like multiprocessing ( multiprocessing — Process-based parallelism — Python 3.9.6 documentation ) speed this up?

Yes, I think so. Probably could create a cache file to store the number of frames in the current video directory as well…