I’m a complete beginner trying to do image classification. My image data and label data comes from two parquet files, one for images and one for labels. I convert them to a big pandas dataframe (30000 rows x 900 columns plus 30000 rows x 1 column, where each row represents a 30 x 30 picture). However all the dataset examples I find use pictures stored in a tree/folder structure. How can I adapt to my case? Have anyone done it this way?
You can create a custom
Dataset with a
__getitem__ method that reads from your pandas dataframe.
The example in this tutorial may be helpful, replace the part of that is reading from file system with reading from your pandas dataframe instead.
Subsequently, you can pass that custom
DataLoader and begin your training.
Hi, thank you for replying. I have read the pytorch tutorial already. However I cannot figure out how to change the custom dataset class to suit my original data, as all the image and label data of different kinds are stored together in one giant dataframe.
Do you have additional context to provide? The basic idea is that when your
Dataset receives an index, you want to read something from the pandas DataFrame and return a sample. For example:
import numpy as np
import pandas as pd
from torch.utils.data import Dataset, DataLoader
def __init__(self, dataframe):
self.dataframe = dataframe
def __getitem__(self, index):
row = self.dataframe.iloc[index].to_numpy()
features = row[1:]
label = row
return features, label
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['label', 'feature_0', 'feature_1'])
data = CustomDataset(dataframe=df)
dataloader = DataLoader(data)
for sample in dataloader:
Thank you for your code example, I find it easier to understand than the tutorial. I’ll try to provide more details later.