Dataset from pandas without folder structure


I’m a complete beginner trying to do image classification. My image data and label data comes from two parquet files, one for images and one for labels. I convert them to a big pandas dataframe (30000 rows x 900 columns plus 30000 rows x 1 column, where each row represents a 30 x 30 picture). However all the dataset examples I find use pictures stored in a tree/folder structure. How can I adapt to my case? Have anyone done it this way?



You can create a custom Dataset with a __getitem__ method that reads from your pandas dataframe.

The example in this tutorial may be helpful, replace the part of that is reading from file system with reading from your pandas dataframe instead.

Subsequently, you can pass that custom Dataset into DataLoader and begin your training.

Hi, thank you for replying. I have read the pytorch tutorial already. However I cannot figure out how to change the custom dataset class to suit my original data, as all the image and label data of different kinds are stored together in one giant dataframe.

Do you have additional context to provide? The basic idea is that when your Dataset receives an index, you want to read something from the pandas DataFrame and return a sample. For example:

import numpy as np
import pandas as pd
from import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, dataframe):
        self.dataframe = dataframe

    def __getitem__(self, index):
        row = self.dataframe.iloc[index].to_numpy()
        features = row[1:]
        label = row[0]
        return features, label

    def __len__(self):
        return len(self.dataframe)

df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                  columns=['label', 'feature_0', 'feature_1'])
data = CustomDataset(dataframe=df)
dataloader = DataLoader(data)
for sample in dataloader:
1 Like

Thank you for your code example, I find it easier to understand than the tutorial. I’ll try to provide more details later.