The best way to deal with that would be to create two Dataset
classes if the datasets are differently structured, Iâd say, and re-use a single Dataset class if the datasets are similarly structured (e.g., in the typical train & test case).
Say I have downloaded the CelebA dataset. I would first make a text file with the file paths of the training samples and labels and a text file with the test samples and labels:
a) âceleba_gender_attr_train.txtâ
b) âceleba_gender_attr_test.txtâ
A file would look like this:
ClassLabel
000001.jpg 0
000002.jpg 0
000003.jpg 1
...
Then I would create a Dataset class where the âinfoâ text file is a instantiation argument, e.g.:
class CelebaDataset(Dataset):
"""Custom Dataset for loading CelebA face images"""
def __init__(self, txt_path, img_dir, transform=None):
df = pd.read_csv(txt_path, sep=" ", index_col=0)
self.img_dir = img_dir
self.txt_path = txt_path
self.img_names = df.index.values
self.y = df['ClassLabel'].values
self.transform = transform
def __getitem__(self, index):
img = Image.open(os.path.join(self.img_dir,
self.img_names[index]))
if self.transform is not None:
img = self.transform(img)
label = self.y[index]
return img, label
def __len__(self):
return self.y.shape[0]
Then maybe adding a custom transform:
custom_transform = transforms.Compose([transforms.Grayscale(),
transforms.ToTensor()])
And finally I would create 2 dataset loaders from the Dataset class:
train_dataset = CelebaDataset(txt_path='celeba_gender_attr_train.txt',
img_dir='img_align_celeba/',
transform=custom_transform)
train_loader = DataLoader(dataset=test_dataset,
batch_size=128,
shuffle=True,
num_workers=4)
and
test_dataset = CelebaDataset(txt_path='celeba_gender_attr_test.txt',
img_dir='img_align_celeba/',
transform=custom_transform)
test_loader = DataLoader(dataset=test_dataset,
batch_size=128,
shuffle=True,
num_workers=4)
Then during training, you could do sth like
for epoch in range(num_epochs):
for batch_idx, features in enumerate(train_loader):
# train model on the training dataset
for batch_idx, features in enumerate(test_loader):
# evaluate model on test dataset
THe test/train split is just an example. You could do the same thing for multiple training datasets and so forth
E.g.,
for epoch in range(num_epochs):
for batch_idx, features in enumerate(train_loader_1):
# train model on the training dataset #1
for batch_idx, features in enumerate(train_loader_2):
# train model on the training dataset #2
for batch_idx, features in enumerate(test_loader):
# evaluate model on test dataset