I want to create a train+val set from my original trainset. The directory is split into train and test. I load the original train set and want to split it into train and val sets so I can evaulate validation loss during training using the train_loader and val_loader.
I’ve gone through other answers on this forum and the following is what I’ve come up with. There’s not a lot of documentation about this which explains things clearly.
image_transforms
simply converts the images ToTensor.
hotdog_dataset = datasets.ImageFolder(root = root_dir + "train",
transform = image_transforms["train"]
)
hotdog_dataset
######### OUTPUT ############
Dataset ImageFolder
Number of datapoints: 498
Root location: ../../../data/computer_vision/image_classification/hot-dog-not-hot-dog/train
StandardTransform
Transform: Compose(
Resize(size=(224, 224), interpolation=PIL.Image.BILINEAR)
ToTensor()
)
Now, I want to create a train+val datasets out of this original train dataset. So, I get the length of this dataset as hotdog_dataset_size
and create a list called hotdog_dataset_indices
which I use as indices. I then obtain the val_split_index
to create two lists which contains the train and validation indices called - train_idx
and val_idx
. I then pass this to SubsetRandomSampler to get train_sampler
and test_sampler
.
VAL_SPLIT_RATIO=0.2
hotdog_dataset_size = len(hotdog_dataset)
hotdog_dataset_indices = list(range(hotdog_dataset_size))
val_split_index = int(np.floor(VAL_SPLIT_RATIO * hotdog_dataset_size))
train_idx, val_idx = hotdog_dataset_indices[val_split_index:], hotdog_dataset_indices[:val_split_index]
train_sampler = SubsetRandomSampler(train_idx)
val_sampler = SubsetRandomSampler(val_idx)
print(train_sampler)
##### OUTPUT #####
<torch.utils.data.sampler.SubsetRandomSampler at 0x7fec33c72d68>
Q1. What is train_sampler
and val_sampler
? What does the output of SubsetRandomSampler
contain? How do we use it?
I then use Subset
on the original dataset and pass train_sampler
and test_sampler
.
hotdog_dataset_train = Subset(dataset=hotdog_dataset, indices=train_sampler)
hotdog_dataset_val = Subset(dataset=hotdog_dataset, indices=val_sampler)
print(hotdog_dataset_train)
### OUTPUT ####
<torch.utils.data.dataset.Subset at 0x7fec2a6720f0>
Q2. Did I use Subset
and SubsetRandomSampler
above correctly? Or should I have directly passed the train_sampler
to a dataloader using the sampler
argument?
After this I created dataloaders for train and val.
train_loader = DataLoader(dataset=hotdog_dataset_train, shuffle=False, batch_size=8)
val_loader = DataLoader(dataset=hotdog_dataset_val, shuffle=False, batch_size=8)
Q3. Do we use shuffle=True
here?
Q4. Can we use different batch_size
for train and val?
When I try to get a single batch from the train loader, I get an error.
single_batch = next(iter(train_loader))
#### OUTPUT ####
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-19-f4d37beb80cc> in <module>
----> 1 single_batch = next(iter(train_loader))
~/miniconda3/envs/toothless/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
343
344 def __next__(self):
--> 345 data = self._next_data()
346 self._num_yielded += 1
347 if self._dataset_kind == _DatasetKind.Iterable and \
~/miniconda3/envs/toothless/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _next_data(self)
383 def _next_data(self):
384 index = self._next_index() # may raise StopIteration
--> 385 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
386 if self._pin_memory:
387 data = _utils.pin_memory.pin_memory(data)
~/miniconda3/envs/toothless/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]
~/miniconda3/envs/toothless/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]
~/miniconda3/envs/toothless/lib/python3.6/site-packages/torch/utils/data/dataset.py in __getitem__(self, idx)
255
256 def __getitem__(self, idx):
--> 257 return self.dataset[self.indices[idx]]
258
259 def __len__(self):
TypeError: 'SubsetRandomSampler' object does not support indexing
The overarching question is, what are the different ways to create a train-val split for image datasets and which of these methods is the recommended way of doing things(especially for the case outlined above).
It would be great if you could show it using code based on the above case.