Hi, I am trying to do train-test split for the dataset. I find that there is a built-in SubsetRandomSampler which does the job. Does anyone know what is the difference between this one with the train-test split function sklearn.model_selection.train_test_split provided in sklearn?
The SubsetRandomSampler samples randomly from a list of indices.
As it won’t create these indices internally, you might want to use something like train_test_split to create the training, eval and test indices and then pass them to the SubsetRandomSampler.
if I create train/validation set by creating two DataLoader object outside the training loop, does it mean that in the training phase, train/validation sets are fixed? Do I need to put them inside the training loop if I want to get different train/validation sets for different epochs?
Does it automatically shuffle the training set when getting mini-batch in each epoch?
The DataLoaders will use the Dataset you are passing.
If you created these Datasets before your training loop, they won’t change.
I’m not sure it’s a good idea to shuffle the training and validation data inside the training loop, as this will yield a data leak.
If you really want to do that, you could create a new sampler or dataset after an epoch and wrap it in a DataLoader and start the next epoch.
The DataLoader will shuffle the data, if you pass shuffle=True as an argument and don’t use an own Sampler. If that’s the case, the Sampler determines, if the data is shuffled or not.
Since the sampler defines the sampling strategy, the shuffle argument would make the sampler meaningless.
Think about some sampler you’ve created to draw a sequence of images in a particular order. If the DataLoader could shuffle, the sequence might be broken. That is why the sampler now defines how to draw samples and when to shuffle.
In your example, you don’t need to specify shuffle=True, as SubsetRandomSampler automatically shuffles the data using the subset indices.