I’m a complete beginner trying to do image classification. My image data and label data comes from two parquet files, one for images and one for labels. I convert them to a big pandas dataframe (30000 rows x 900 columns plus 30000 rows x 1 column, where each row represents a 30 x 30 picture). However all the dataset examples I find use pictures stored in a tree/folder structure. How can I adapt to my case? Have anyone done it this way?
Hi, thank you for replying. I have read the pytorch tutorial already. However I cannot figure out how to change the custom dataset class to suit my original data, as all the image and label data of different kinds are stored together in one giant dataframe.
Do you have additional context to provide? The basic idea is that when your Dataset receives an index, you want to read something from the pandas DataFrame and return a sample. For example:
import numpy as np
import pandas as pd
from torch.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
def __init__(self, dataframe):
self.dataframe = dataframe
def __getitem__(self, index):
row = self.dataframe.iloc[index].to_numpy()
features = row[1:]
label = row[0]
return features, label
def __len__(self):
return len(self.dataframe)
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['label', 'feature_0', 'feature_1'])
data = CustomDataset(dataframe=df)
dataloader = DataLoader(data)
for sample in dataloader:
print(sample)