Writing fast Dataset class

Shalabh_Gupta · July 3, 2020, 7:02am

Hi all! I am relatively new to pytorch, and wanted some help in writing a dataset class that could possibly use pytorch inbuilt dataset generators like DataFolder or other utilities.

I am working on the ModelNet10 dataset and my directory structure looks like follows:
root directory contains 10 folders (bathtub,bedroom,toilet,…). These contain 2 folders (train and test) each, which further contain around 500 stl files. Right now , my dataset class looks like follows:

class MeshData(Dataset):

def __init__(self,root_dir):
    self.classes = set()
    self.X = []
    self.Y = []
    self.final_dataset = []
    self.classes_codec = LabelEncoder()

    lst_of_classes = os.listdir(root_dir)
    lst_of_classes.sort()

    self.classes_codec.fit(lst_of_classes)

    for x in lst_of_classes:
        print(x)

        path = root_dir + '/' + x + '/train'
        lst_of_objects = os.listdir(path)
        for y in lst_of_objects:
            file_path = path + '/' + y

            dict_para = input_stl(file_path)
            if dict_para == False:
                print('Input file',y,'has problems')
                continue
            else:	
                neigh = dict_para["neigh_index"]
                corner = dict_para["corners"]
                center = dict_para["centroids"]
                normal = dict_para["normals"]

                self.X.append(np.concatenate((center,corner,normal,neigh),axis=1))
                self.Y.append((self.one_hot_encode(self.classes_codec,[x])))


    self.final_dataset = [self.X,self.Y]

def one_hot_encode(self,codec,values):
    value_idxs = codec.transform(values)
    val,idx = torch.max(torch.eye(len(codec.classes_))[value_idxs],1)
    return torch.LongTensor(idx)



def __len__(self):
    return len(self.final_dataset)

def __getitem__(self,idx):
    return torch.from_numpy(self.X[idx]),self.Y[idx]

Here, i am running brute force over all folders and files, loading them in the format i want and one hot encoding the labels. But this is just too time consuming.
I looked up some documentation and found torchvision.datasets.DatasetFolder to be doing a similar job, but on directory structure as follows:

root/class_x/xxx.ext
root/class_x/xxy.ext
root/class_x/xxz.ext

root/class_y/123.ext
root/class_y/nsdf3.ext
root/class_y/asd932_.ext

Only difference is I have train and test folders inside class_x and class_y and i only want to load training .stl files.

Thanks a lot.

ptrblck · July 3, 2020, 8:05am

The main difference between the ImageFolder and your approach would be the lazy vs. eager loading.
While the ImageFolder dataset loads each sample lazily in the __getitem__ method, it seems you are preloading all samples in the __init__ method.

You could try to refactor your code such that you are only creating all valid paths (and targets, if it’s cheap to calculate them) in the __init__ and execute the loading using these paths in __getitem__.

Let me know, if this would help.

Shalabh_Gupta · July 3, 2020, 8:49am

Hi @ptrblck, thanks for the reply. If I want to process the data efficiently in getitem method, i would have to keep the path and class name saved for each index idx in the init function itself, right? So that we could just lookup the idx easily.

Thanks.

ptrblck · July 4, 2020, 2:34am

Yes, that is correct.
The usual approach would be to store the image paths and the targets in different lists and just index them in the __getitem__ method with the provided index argument.

However, you would have to make sure that the lists containing the image path and target keep their correspondence. Sorting one of them would break your training.
Alternatively you could also use a dict, and append each item as {'img_path': path, 'target': target}, but it depends on your coding style.