When should I do this based on
Dataset
class? Should I do my processing step in__getitem__()
? If yes, would it be parallel and fast?
Yes. the __getitem__
calls are the ones run in parallel. Here’s some example:
class CelebaDataset(Dataset):
"""Custom Dataset for loading CelebA face images"""
def __init__(self, txt_path, img_dir, transform=None):
df = pd.read_csv(txt_path, sep=" ", index_col=0)
self.img_dir = img_dir
self.txt_path = txt_path
self.img_names = df.index.values
self.y = df['Male'].values
self.transform = transform
def __getitem__(self, index):
img = Image.open(os.path.join(self.img_dir,
self.img_names[index]))
if self.transform is not None:
img = self.transform(img)
label = self.y[index]
return img, label
def __len__(self):
return self.y.shape[0]
For the “transforms” call you can can use the torchvision.transforms
utilities to “compose” a transformer pipeline. E.g.,
custom_transform = transforms.Compose([transforms.Grayscale(),
#transforms.Lambda(lambda x: x/255.), # scaling is already done by ToTensor()
transforms.ToTensor()])
train_dataset = CelebaDataset(txt_path='celeba_gender_attr_train.txt',
img_dir='img_align_celeba/',
transform=custom_transform)
train_loader = DataLoader(dataset=train_dataset,
batch_size=128,
shuffle=True,
num_workers=4)