Issues with torch.utils.data.random_split

If you wrap your Dataset into a Subset, you could pass the training and validation indices to it.
Each Subset will accept indices in the range [0, len(subset)].

The passed indices to create the Subsets should of course not be overlapping.

I have a dataset folder(with ‘.jpg’ and .xml files) for object detection and I want to split it into train and validation set with respect to file names like we do by creating separate train and test folder,I also tried torch.utils.data.random_split but it splits by xml object index not by images.Is there any method to do this?

I assume each image file has a corresponding xml file with the annotations for the object detection task?
If so, I would recommend to create a custom dataset and create the mapping between the image paths and xml paths inside the Dataset.__init__ method.
Once you have this mapping, you could load each image and its annotation in the Dataset.__getitem__ method.

To split the dataset into a training, validation, and test set you could create all indices and shuffle them via torch.randperm(len(dataset)) or alternatively you could also use e.g. sklearn.model_selection.train_test_split, which has a few more options.
These indices can then be passed to a Subset, which wraps the dataset to create the dataset splits.
Alternatively you could also use a SubsetRandomSampler and pass these samplers to the DataLoader while creating the training, validation, and test dataloaders.

Let me know, if that would work for you.

Could you please help me with the sample code?

Could you post one or two dummy inputs with their right format, please?

sure.Right now I am using this :

trainset = core.Dataset(dataset_path)
train_len=int(len(trainset)*0.8)
test_len=len(trainset)-int(len(trainset)*0.8)
train_set=torch.utils.data.Subset(trainset,range(0,train_len))
val_set=torch.utils.data.Subset(trainset,range(train_len,len(trainset)))

But I want to shuffle train and val set wrt file names because if I shuffle them by indices then some indices of a common file might be in both train and test.

for example: a file ‘cat1.jpg’ has 3 cats on indices 0,1,and 2 in ‘cat1.xml’ so I dont want 0,1 indices in train and 3 indices in test or validation set…I want all three indices of same file either in train or test set

I assume you’ve already created the dataset and are able to load each sample?
If so, you could use sklearn.model_selection.GroupShuffleSplit, which takes an additional groups argument to the split method in order to create the training and test indices.
For the groups you could use the file name passed as indices.
Once you have the indices, you can pass them to the Subset.

1 Like

It worked. Thanks a lot :blush:

I’ve created a script which is given here: How to split dataset into test and validation sets

Does it splits each class in 80:20 ratio or just randomly splits whole dataset in 80: 20 ?

train_dataset,test_dataset=torch.utils.data.random_split(ants_dataset,(train_length,test_length))

It splits the data randomly. If you want to apply a stratified split, you could use sklearn.model_selection.train_test_split and provide the stratify argument to create the training and validation indices, which can then be used in a Subset or RandomSubsetSampler.

3 Likes

should I use this? I also want to split data by filenames as well.

from sklearn.model_selection import GroupShuffleSplit
train_index, test_index = next(GroupShuffleSplit(n_splits=1, test_size=0.2,random_state=15).split(trainset_df, groups=trainset_df.filename))

train_set=torch.utils.data.Subset(trainset,train_index)
val_set=torch.utils.data.Subset(trainset,test_index)

1 Like

How can i do augmentation after i split the dataset into train and validation since augmentation must be done only on training set? Or can i do random_split before converting image data to tensor?

I have a doubt in creaing splits of the dataset. Is there any way to split the dataset in such a way that the cross-validation data should be a subset of train split.?How to do this

You could use e.g. sklearn.model_selection.KFold to create the split indices and based on these create Subsets to train the current fold.

Can you please share a sample code because I am new to PyTorch

You could reuse the example given in the link I’ve posted and use the indices to create Subsets:

import numpy as np
from sklearn.model_selection import KFold
dataset = MyDataset()
kf = KFold(n_splits=2)
idx = np.arange(len(dataset))
kf.get_n_splits(idx)

print(kf)

for train_index, test_index in kf.split(idx):
    print("TRAIN:", train_index, "TEST:", test_index)
    train_dataset = Subset(dataset, train_index)
    test_dataset = Subset(dataset, test_index)
1 Like

I’ve got problem where i cant use sklearn to stratify my torch dataset.

I have a custom pytorch Dataset which i use to create a train_dataset. I have to split that custom build dataset to use it in Dataloader
I tried this as from stackoverflow to split

       train_size= int(train_size*len(dataset))
        val_size = len(dataset) - train_size
        train_dataset, val_dataset = torch.utils.data.random_split(dataset, [train_size, val_size])
        train_loader = DataLoader(train_dataset, batch_size=batch_size, collate_fn=collate, shuffle=True)
        val_loader = DataLoader(val_dataset, batch_size=batch_size, collate_fn=collate, shuffle=True)
        return train_loader, val_loader

which gives imbalanced class to training and validation set and i cant use skearn.model_selection.train_test_split to dataset class.I build as

dataset = utils.Vocabulary(x_train, y_train)

where Vocabulary is pytorch Dataset class as

Summary
class Vocabulary(Dataset):
    """Build custom dataset for dataloader"""
    def __init__(self, df_train, df_labels):
        self.labels = df_labels
        self.word2index, self.tokenizer = self.build_vectorizer(df_train)
        
        sequences = [self.convert_sequence(sequence, self.word2index, self.tokenizer)
                     for sequence in df_train]
        self.max_seq_len = max([len(seq) for seq in sequences])
        self.sequences = [self.pad_index(sequence,self.max_seq_len, self.word2index) 
                          for sequence in sequences]
        self.labels = df_labels
        
        
    def build_vectorizer(self, sequences_lists, stop_w='english', min_df=0):
        vectorizer = CountVectorizer(stop_words=stop_w, min_df=min_df)
        vectorizer.fit(sequences_lists)
        word2index = vectorizer.vocabulary_
        word2index['<PAD>'] = max(word2index.values()) + 1
        tokenizer = vectorizer.build_analyzer()
        return word2index, tokenizer
    
    def convert_sequence(self, sequence, word2index, tokenizer_func):
        """encode a sequence  to a list of indexes"""
        return [word2index[word] for word in tokenizer_func(sequence)
               if word in word2index]
    
    def pad_index(self,sequence, max_seq_len, word2index, pad_key='<PAD>'):
        """pads a sequence of indexes to max length """
        return sequence + (max_seq_len - len(sequence)) * [word2index[pad_key]]
        
    def __getitem__(self, i):
        assert len(self.sequences[i]) == self.max_seq_len
        return self.sequences[i], self.labels[i]
    
    def __len__(self):
        return len(self.sequences)
    

How can i solve this problem?
Thanks

Why can’t you use df_labels to create the spit indices in train_test_split?

I thought, because they are in vocabulary class, so i couldnt use train_test_split so that it could be use in DataLoader.
df_train is training sequences and df_labels is targets. Like while making dataloader I have done this

dataloader = DataLoader(dataset, batch_size=batch_size, collate_fn=collate, shuffle=True)
and i was thinking i couldnt use train_test_split instead used random_split, but i am worrying if validation somehow would get unbalanced class.

Thanks for helping :slight_smile: