Hi, I am trying to do train-test split for the dataset. I find that there is a built-in
SubsetRandomSampler which does the job. Does anyone know what is the difference between this one with the train-test split function
sklearn.model_selection.train_test_split provided in sklearn?
SubsetRandomSampler samples randomly from a list of indices.
As it won’t create these indices internally, you might want to use something like
train_test_split to create the training, eval and test indices and then pass them to the
Great, thank you!
I have two other questions:
if I create train/validation set by creating two DataLoader object outside the training loop, does it mean that in the training phase, train/validation sets are fixed? Do I need to put them inside the training loop if I want to get different train/validation sets for different epochs?
Does it automatically shuffle the training set when getting mini-batch in each epoch?
DataLoaders will use the
Dataset you are passing.
If you created these
Datasets before your training loop, they won’t change.
I’m not sure it’s a good idea to shuffle the training and validation data inside the training loop, as this will yield a data leak.
If you really want to do that, you could create a new sampler or dataset after an epoch and wrap it in a
DataLoader and start the next epoch.
DataLoader will shuffle the data, if you pass
shuffle=True as an argument and don’t use an own
Sampler. If that’s the case, the
Sampler determines, if the data is shuffled or not.
In the documentation (https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) it says:
sampler ( Sampler , optional ) – defines the strategy to draw samples from the dataset. If specified,
shuffle must be False.
Does that mean there would be problems if I define my training/validation set this way:
train_loader = DataLoader(dataset, batch_size=50,
valid_loader = DataLoader(dataset, batch_size=50,
Why would it be an issue if I want to shuffle the data when generating batch if SubsetRandomSampler is applied?
Since the sampler defines the sampling strategy, the
shuffle argument would make the sampler meaningless.
Think about some sampler you’ve created to draw a sequence of images in a particular order. If the
DataLoader could shuffle, the sequence might be broken. That is why the sampler now defines how to draw samples and when to shuffle.
In your example, you don’t need to specify
SubsetRandomSampler automatically shuffles the data using the subset indices.