How do I split a custom dataset into training and test datasets with SubsetRandomSampler?

Hi everyone, I’m trying to create the train, test and evaluate datasets with SubsetRandomSampler but I’m unfortunately not understanding how it actually works. Can anyone please help me?


class SpeakerRecognitionDataset(Dataset):
def init(self, csv_file, root_dir, transform=None):
csv_file (string): Path to the csv file with annotations.
root_dir (string): Directory with all the sounds.
transform (callable, optional): Optional transform to be applied
on a sample.
# Read the Data of every speaker
self.speaker_sound = pd.read_csv(csv_file, sep=’ ')
self.root_dir = root_dir
self.transform = transform

def __len__(self):
    # Number of all sounds
    return len(self.speaker_sound)

def __getitem__(self, idx):
    #if torch.is_tensor(idx):
    #    idx = idx.tolist()
    sound_file_name = os.path.join(self.root_dir, self.speaker_sound.iloc[idx, 1])
    if os.path.isfile(sound_file_name):
        waveform, sample_rate = torchaudio.load(sound_file_name)
        if self.transform:
            waveform = self.transform(waveform)

    return waveform

speaker = SpeakerRecognitionDataset(csv_file,

dataloader = DataLoader(speaker, batch_size=10,
shuffle=True, num_workers=1, pin_memory=True)

It seems you have already created the custom Dataset to load all data.
Now you could create the indices for all samples e.g. using torch.arange(len(dataset)). You could then split these indices into training, validation, as well as test indices, and pass these indices to SubsetRandomSampler. These samplers can then be passed to DataLoaders to create the training, validation, and test loaders.

1 Like

Hi, I guess it would work if I would randomly choose my Audio files. But in the photo the ones with number 1 (pink color) should go to train, number 2 val and number 3 to test. Can you please suggest me how can I create the indices in this case? Thank you in advance :slight_smile:

Since your csv file already has the split indices, you could load it e.g. with pandas and create three separate lists of file paths (one list corresponding to each split).
In that case you could directly create three separate Datasets using these file paths lists.