How to create a train-val split in a custom image datasets using Subset and SubsetRandomSampler?

theairbend3r · March 6, 2020, 2:22pm

I want to create a train+val set from my original trainset. The directory is split into train and test. I load the original train set and want to split it into train and val sets so I can evaulate validation loss during training using the train_loader and val_loader.

I’ve gone through other answers on this forum and the following is what I’ve come up with. There’s not a lot of documentation about this which explains things clearly.

image_transforms simply converts the images ToTensor.

hotdog_dataset = datasets.ImageFolder(root = root_dir + "train",
                                      transform = image_transforms["train"]
                                     )

hotdog_dataset


######### OUTPUT ############

Dataset ImageFolder
    Number of datapoints: 498
    Root location: ../../../data/computer_vision/image_classification/hot-dog-not-hot-dog/train
    StandardTransform
Transform: Compose(
               Resize(size=(224, 224), interpolation=PIL.Image.BILINEAR)
               ToTensor()
           )

Now, I want to create a train+val datasets out of this original train dataset. So, I get the length of this dataset as hotdog_dataset_size and create a list called hotdog_dataset_indices which I use as indices. I then obtain the val_split_index to create two lists which contains the train and validation indices called - train_idx and val_idx. I then pass this to SubsetRandomSampler to get train_sampler and test_sampler.

VAL_SPLIT_RATIO=0.2

hotdog_dataset_size = len(hotdog_dataset)
hotdog_dataset_indices = list(range(hotdog_dataset_size))

val_split_index = int(np.floor(VAL_SPLIT_RATIO * hotdog_dataset_size))

train_idx, val_idx = hotdog_dataset_indices[val_split_index:], hotdog_dataset_indices[:val_split_index]

train_sampler = SubsetRandomSampler(train_idx)
val_sampler = SubsetRandomSampler(val_idx)

print(train_sampler)


##### OUTPUT #####

<torch.utils.data.sampler.SubsetRandomSampler at 0x7fec33c72d68>

Q1. What is train_sampler and val_sampler? What does the output of SubsetRandomSampler contain? How do we use it?

I then use Subset on the original dataset and pass train_sampler and test_sampler.

hotdog_dataset_train = Subset(dataset=hotdog_dataset, indices=train_sampler)
hotdog_dataset_val = Subset(dataset=hotdog_dataset, indices=val_sampler)

print(hotdog_dataset_train)

### OUTPUT ####
<torch.utils.data.dataset.Subset at 0x7fec2a6720f0>

Q2. Did I use Subset and SubsetRandomSampler above correctly? Or should I have directly passed the train_sampler to a dataloader using the sampler argument?

After this I created dataloaders for train and val.

train_loader = DataLoader(dataset=hotdog_dataset_train, shuffle=False, batch_size=8)
val_loader = DataLoader(dataset=hotdog_dataset_val, shuffle=False, batch_size=8)

Q3. Do we use shuffle=True here?
Q4. Can we use different batch_size for train and val?

When I try to get a single batch from the train loader, I get an error.

single_batch = next(iter(train_loader))


#### OUTPUT ####
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-19-f4d37beb80cc> in <module>
----> 1 single_batch = next(iter(train_loader))

~/miniconda3/envs/toothless/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
    343 
    344     def __next__(self):
--> 345         data = self._next_data()
    346         self._num_yielded += 1
    347         if self._dataset_kind == _DatasetKind.Iterable and \

~/miniconda3/envs/toothless/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _next_data(self)
    383     def _next_data(self):
    384         index = self._next_index()  # may raise StopIteration
--> 385         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    386         if self._pin_memory:
    387             data = _utils.pin_memory.pin_memory(data)

~/miniconda3/envs/toothless/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

~/miniconda3/envs/toothless/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

~/miniconda3/envs/toothless/lib/python3.6/site-packages/torch/utils/data/dataset.py in __getitem__(self, idx)
    255 
    256     def __getitem__(self, idx):
--> 257         return self.dataset[self.indices[idx]]
    258 
    259     def __len__(self):

TypeError: 'SubsetRandomSampler' object does not support indexing

The overarching question is, what are the different ways to create a train-val split for image datasets and which of these methods is the recommended way of doing things(especially for the case outlined above).

It would be great if you could show it using code based on the above case.

Blueberry · March 7, 2020, 2:15pm

when you use SubsetRandomSampler it by defaults suffles the dataset so you can’t use shuffle while using it.

train_loader = DataLoader(dataset=hotdog_dataset_train,  batch_size=8)
val_loader = DataLoader(dataset=hotdog_dataset_val, batch_size=8)

Should work.
Or you can directly use following commands and remove previous declarations of hotdog_dataset_train and hotdog_dataset_val

train_loader = DataLoader(dataset=hotdog_dataset,  batch_size=8, sampler = train_sampler)
val_loader = DataLoader(dataset=hotdog_dataset, batch_size=8, sampler = val_sampler)

MJN · May 15, 2020, 9:21am

the train_loader and val_loader length are same as train_data length. But the train_sampler and val_sampler length are different is that possible?

    TRAIN_DATA_PATH = "./Dataset/Train"
    TEST_DATA_PATH = "./Dataset/Test"
    VALID_DATA_PATH = "./Dataset/Valid"
    BATCH_SIZE = 32

    TRANSFORM_IMG = transforms.Compose([transforms.Resize((224,224)),
    transforms.ToTensor(),transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225] )
    ])

    train_data = torchvision.datasets.ImageFolder(root=TRAIN_DATA_PATH, 
  transform=TRANSFORM_IMG)
    
    train_loader_all = data.DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True,  num_workers=4)

  
    validation_split = .3
    shuffle_dataset = True
    random_seed= 42

    # Creating data indices for training and validation splits:  
    dataset_size = len(train_loader_all.dataset)
    indices = list(range(dataset_size))
    split = int(np.floor(validation_split * dataset_size))
    if shuffle_dataset :
       np.random.seed(random_seed)
       np.random.shuffle(indices)
    train_indices, val_indices = indices[split:], indices[:split]

    # Creating PT data samplers and loaders:
    train_sampler = SubsetRandomSampler(train_indices)
    valid_sampler = SubsetRandomSampler(val_indices)
    print("leng:",len(train_sampler))
    print("leng:",len(valid_sampler))

    train_loader = torch.utils.data.DataLoader(train_data, batch_size=BATCH_SIZE,sampler=train_sampler)
    val_loader = torch.utils.data.DataLoader(train_data, batch_size=BATCH_SIZE,sampler=valid_sampler)
    
    print("val loader:",len(val_loader.dataset))
    print("train loader",len(train_loader.dataset))

this is the output

leng: 25970
leng: 11130
val loader: 37100
train loader 37100