Dataset from pandas without folder structure

Hi,

I’m a complete beginner trying to do image classification. My image data and label data comes from two parquet files, one for images and one for labels. I convert them to a big pandas dataframe (30000 rows x 900 columns plus 30000 rows x 1 column, where each row represents a 30 x 30 picture). However all the dataset examples I find use pictures stored in a tree/folder structure. How can I adapt to my case? Have anyone done it this way?

Regards,

Erik

1 Like

You can create a custom Dataset with a __getitem__ method that reads from your pandas dataframe.

The example in this tutorial may be helpful, replace the part of that is reading from file system with reading from your pandas dataframe instead.

Subsequently, you can pass that custom Dataset into DataLoader and begin your training.

Hi, thank you for replying. I have read the pytorch tutorial already. However I cannot figure out how to change the custom dataset class to suit my original data, as all the image and label data of different kinds are stored together in one giant dataframe.

1 Like

Do you have additional context to provide? The basic idea is that when your Dataset receives an index, you want to read something from the pandas DataFrame and return a sample. For example:

import numpy as np
import pandas as pd
from torch.utils.data import Dataset, DataLoader


class CustomDataset(Dataset):
    def __init__(self, dataframe):
        self.dataframe = dataframe

    def __getitem__(self, index):
        row = self.dataframe.iloc[index].to_numpy()
        features = row[1:]
        label = row[0]
        return features, label

    def __len__(self):
        return len(self.dataframe)


df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                  columns=['label', 'feature_0', 'feature_1'])
data = CustomDataset(dataframe=df)
dataloader = DataLoader(data)
for sample in dataloader:
    print(sample)
3 Likes

Thank you for your code example, I find it easier to understand than the tutorial. I’ll try to provide more details later.