Custom dataloader structure

juhyung · November 9, 2018, 6:53am

I have a dataset which has images and corresponding texts.

I want to cut part of a dataset to get input data in a shape of 4 sequence images with corresponding texts.
like data = [ [image1, image2, image3, image4] ,[text1, ...text4] ]
So, if I have 6 images, I can make 3 inputs data.

my dataset loader is like below

class dataset(Dataset):
    def __init__(self, path, start, end):
        self.path = path
        self.data = self.getsubsequnce(start, end)
    
    def _getMegabatch(self, start, end):
        file = h5.File(self.path)[start:end]
        return file

    def _getsubsequence(self,start,end):
         megabatch = self._getMegabatch(start, end)
         """
         returns set of 4 length images
         """

    def __len__(self):
         return len(self.data)

    def __getitem__(self, index):
        data = self.data
        return data[index]

I’m wondering this is a common way of generating mini-batches from mega-batch data.
Could you give me an advice??

ptrblck · November 9, 2018, 1:05pm

I’m not sure I understand the implementation of your Dataset fully.
Are you creating a new Dataset for each sequence, i.e. when are you passing new start and end values?
It looks like you are returning one sample from the sequence. As far as I’ve understood your use case, I thought you would like to return 4 samples holding the images and text data?

Could you post the shapes of your whole dataset and each sample?

juhyung · November 9, 2018, 1:42pm

@ptrblck Okay I think above code is too simplified.

overall, from my whole dataset, _getMegabatch will get Mega-batch dataset[start:end]
then from the megabatch I want to return minibatches.

Specifically, my whole dataset is a list [images, texts], images.shape = [20000, 224, 224, 3], texts.shape=[20000, 30]

and _getMegabatch() slices whole data to mega-batch data [images, texts] with images.shape = [6, 224, 224, 3] texts.shape=[6,30] mega-batch size 6 is just an example.

For easy understanding let’s represent this mega-batch data images (shape [6, 224, 224, 3] and texts as [a, b, c, d, e, f], [1,2,3,4,5, 6], each alphabet, number is image and text.

From this mega-batch data I am gonna get all 4-length sub-sequence datas with _getsubsequence like self.data =[ [ [a, b, c, d], [b, c, d, e], [c, d, e, f] ], [ [1, 2, 3, 4], [2, 3, 4, 5], [3, 4, 5, 6] ] ]

Finally, __getitem__ will return self.data[0][index], self.data[1][index] e.g. [a, b, c, d], [1, 2, 3, 4]

I`m not sure this mega-batch \in mini-batch dataloader structure is efficient or not.