My dataset folder is prepared as Train Folder and Test Folder. When I conduct experiments, I further split my Train Folder data into Train and Validation.
However, transform is applied before my split and they are the same for both my Train and Validation. My question is how to apply a different transform in this case?
# obtain training indices that will be used for validation
num_train = len(train_data)
indices = list(range(num_train))
np.random.shuffle(indices)
split = int(np.floor(valid_size * num_train))
train_idx, valid_idx = indices[split:], indices[:split]
# define samplers for obtaining training and validation batches
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)
# prepare data loaders
train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size,
sampler=train_sampler, num_workers=num_workers)
valid_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size,
sampler=valid_sampler, num_workers=num_workers)
Then instead of applying the transformation when creating the ImageFolder dataset, you can apply it to the individual splitted dataset using such a helpful class:
class MapDataset(torch.utils.data.Dataset):
"""
Given a dataset, creates a dataset which applies a mapping function
to its items (lazily, only when an item is called).
Note that data is not cloned/copied from the initial dataset.
"""
def __init__(self, dataset, map_fn):
self.dataset = dataset
self.map = map_fn
def __getitem__(self, index):
return self.map(self.dataset[index])
def __len__(self):
return len(self.dataset)
Note that here, the mapping function is applied on the output of the dataset (which might be both the input and the target). You might want either to pass a mapping function that handle this or to modify the class to your needs.
Should I still use ImageFolder to obtain the variable train_data?
For example train_data = datasets.ImageFolder(base_path + '/train/') without transfrom
Hi gregunz, apologize for late reply.
You code is great, but it needs to change a bit in ‘getitem’ to access iamges and labels in my case.
I have used your code and the code here Using ImageFolder, random_split with multiple transforms.
The resulting code coulde work for me. Let me know if my I did it correctly. You can refine this code if there are any mistakes and then I will accept it as a solution.
class MapDataset(torch.utils.data.Dataset):
"""
Given a dataset, creates a dataset which applies a mapping function
to its items (lazily, only when an item is called).
Note that data is not cloned/copied from the initial dataset.
"""
def __init__(self, dataset, map_fn):
self.dataset = dataset
self.map = map_fn
def __getitem__(self, index):
if self.map:
x = self.map(self.dataset[index][0])
else:
x = self.dataset[index][0] # image
y = self.dataset[index][1] # label
return x, y
def __len__(self):
return len(self.dataset)
The best solution is actually to load the dataset twice, and then apply different transformation for each. I haven’t test teh following code, but hope you can get the idea. The key idea is that the train_data and valida_data are exactly the same, but only different in data_transform.
train_data = datasets.ImageFolder(base_path + '/train/',
transform=data_transform_train)
valid_data = datasets.ImageFolder(base_path + '/train/',
transform=data_transform_val)
# obtain training indices that will be used for validation
num_train = len(train_data)
indices = list(range(num_train))
np.random.shuffle(indices)
split = int(np.floor(valid_size * num_train))
train_idx, valid_idx = indices[split:], indices[:split]
# define samplers for obtaining training and validation batches
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)
# prepare data loaders
train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size,
sampler=train_sampler, num_workers=num_workers)
valid_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size,
sampler=valid_sampler, num_workers=num_workers)