SubsetRandomSampler vs sklearn.model_selection.train_test_split?

Hi, I am trying to do train-test split for the dataset. I find that there is a built-in SubsetRandomSampler which does the job. Does anyone know what is the difference between this one with the train-test split function sklearn.model_selection.train_test_split provided in sklearn?

Thanks!

The SubsetRandomSampler samples randomly from a list of indices.
As it won’t create these indices internally, you might want to use something like train_test_split to create the training, eval and test indices and then pass them to the SubsetRandomSampler.

1 Like

Great, thank you!

I have two other questions:

  1. if I create train/validation set by creating two DataLoader object outside the training loop, does it mean that in the training phase, train/validation sets are fixed? Do I need to put them inside the training loop if I want to get different train/validation sets for different epochs?

  2. Does it automatically shuffle the training set when getting mini-batch in each epoch?

The DataLoaders will use the Dataset you are passing.
If you created these Datasets before your training loop, they won’t change.
I’m not sure it’s a good idea to shuffle the training and validation data inside the training loop, as this will yield a data leak.
If you really want to do that, you could create a new sampler or dataset after an epoch and wrap it in a DataLoader and start the next epoch.

The DataLoader will shuffle the data, if you pass shuffle=True as an argument and don’t use an own Sampler. If that’s the case, the Sampler determines, if the data is shuffled or not.

Thanks!

In the documentation (https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) it says:

  • sampler ( Sampler , optional ) – defines the strategy to draw samples from the dataset. If specified, shuffle must be False.

Does that mean there would be problems if I define my training/validation set this way:

train_loader = DataLoader(dataset, batch_size=50,
             sampler=SubsetRandomSampler(train_idx), shuffle=True)
valid_loader = DataLoader(dataset, batch_size=50,
             sampler=SubsetRandomSampler(valid_idx), shuffle=True)

Why would it be an issue if I want to shuffle the data when generating batch if SubsetRandomSampler is applied?

Since the sampler defines the sampling strategy, the shuffle argument would make the sampler meaningless.
Think about some sampler you’ve created to draw a sequence of images in a particular order. If the DataLoader could shuffle, the sequence might be broken. That is why the sampler now defines how to draw samples and when to shuffle.

In your example, you don’t need to specify shuffle=True, as SubsetRandomSampler automatically shuffles the data using the subset indices.

1 Like