If you wrap your Dataset
into a Subset
, you could pass the training and validation indices to it.
Each Subset
will accept indices in the range [0, len(subset)]
.
The passed indices to create the Subsets
should of course not be overlapping.
If you wrap your Dataset
into a Subset
, you could pass the training and validation indices to it.
Each Subset
will accept indices in the range [0, len(subset)]
.
The passed indices to create the Subsets
should of course not be overlapping.
I have a dataset folder(with ‘.jpg’ and .xml files) for object detection and I want to split it into train and validation set with respect to file names like we do by creating separate train and test folder,I also tried torch.utils.data.random_split but it splits by xml object index not by images.Is there any method to do this?
I assume each image file has a corresponding xml
file with the annotations for the object detection task?
If so, I would recommend to create a custom dataset and create the mapping between the image paths and xml
paths inside the Dataset.__init__
method.
Once you have this mapping, you could load each image and its annotation in the Dataset.__getitem__
method.
To split the dataset into a training, validation, and test set you could create all indices and shuffle them via torch.randperm(len(dataset))
or alternatively you could also use e.g. sklearn.model_selection.train_test_split
, which has a few more options.
These indices can then be passed to a Subset
, which wraps the dataset
to create the dataset splits.
Alternatively you could also use a SubsetRandomSampler
and pass these samplers to the DataLoader
while creating the training, validation, and test dataloaders.
Let me know, if that would work for you.
Could you please help me with the sample code?
Could you post one or two dummy inputs with their right format, please?
sure.Right now I am using this :
trainset = core.Dataset(dataset_path)
train_len=int(len(trainset)*0.8)
test_len=len(trainset)-int(len(trainset)*0.8)
train_set=torch.utils.data.Subset(trainset,range(0,train_len))
val_set=torch.utils.data.Subset(trainset,range(train_len,len(trainset)))
But I want to shuffle train and val set wrt file names because if I shuffle them by indices then some indices of a common file might be in both train and test.
for example: a file ‘cat1.jpg’ has 3 cats on indices 0,1,and 2 in ‘cat1.xml’ so I dont want 0,1 indices in train and 3 indices in test or validation set…I want all three indices of same file either in train or test set
I assume you’ve already created the dataset and are able to load each sample?
If so, you could use sklearn.model_selection.GroupShuffleSplit
, which takes an additional groups
argument to the split
method in order to create the training and test indices.
For the groups
you could use the file name passed as indices.
Once you have the indices, you can pass them to the Subset
.
It worked. Thanks a lot
I’ve created a script which is given here: How to split dataset into test and validation sets
Does it splits each class in 80:20 ratio or just randomly splits whole dataset in 80: 20 ?
train_dataset,test_dataset=torch.utils.data.random_split(ants_dataset,(train_length,test_length))
It splits the data randomly. If you want to apply a stratified split, you could use sklearn.model_selection.train_test_split
and provide the stratify
argument to create the training and validation indices, which can then be used in a Subset
or RandomSubsetSampler
.
should I use this? I also want to split data by filenames as well.
from sklearn.model_selection import GroupShuffleSplit
train_index, test_index = next(GroupShuffleSplit(n_splits=1, test_size=0.2,random_state=15).split(trainset_df, groups=trainset_df.filename))
train_set=torch.utils.data.Subset(trainset,train_index)
val_set=torch.utils.data.Subset(trainset,test_index)
How can i do augmentation after i split the dataset into train and validation since augmentation must be done only on training set? Or can i do random_split before converting image data to tensor?
I have a doubt in creaing splits of the dataset. Is there any way to split the dataset in such a way that the cross-validation data should be a subset of train split.?How to do this
You could use e.g. sklearn.model_selection.KFold
to create the split indices and based on these create Subsets
to train the current fold.
Can you please share a sample code because I am new to PyTorch
You could reuse the example given in the link I’ve posted and use the indices to create Subsets
:
import numpy as np
from sklearn.model_selection import KFold
dataset = MyDataset()
kf = KFold(n_splits=2)
idx = np.arange(len(dataset))
kf.get_n_splits(idx)
print(kf)
for train_index, test_index in kf.split(idx):
print("TRAIN:", train_index, "TEST:", test_index)
train_dataset = Subset(dataset, train_index)
test_dataset = Subset(dataset, test_index)
I’ve got problem where i cant use sklearn
to stratify my torch dataset.
I have a custom pytorch Dataset which i use to create a train_dataset. I have to split that custom build dataset to use it in Dataloader
I tried this as from stackoverflow to split
train_size= int(train_size*len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = torch.utils.data.random_split(dataset, [train_size, val_size])
train_loader = DataLoader(train_dataset, batch_size=batch_size, collate_fn=collate, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, collate_fn=collate, shuffle=True)
return train_loader, val_loader
which gives imbalanced class to training and validation set and i cant use skearn.model_selection.train_test_split
to dataset
class.I build as
dataset = utils.Vocabulary(x_train, y_train)
where Vocabulary is pytorch Dataset class as
class Vocabulary(Dataset):
"""Build custom dataset for dataloader"""
def __init__(self, df_train, df_labels):
self.labels = df_labels
self.word2index, self.tokenizer = self.build_vectorizer(df_train)
sequences = [self.convert_sequence(sequence, self.word2index, self.tokenizer)
for sequence in df_train]
self.max_seq_len = max([len(seq) for seq in sequences])
self.sequences = [self.pad_index(sequence,self.max_seq_len, self.word2index)
for sequence in sequences]
self.labels = df_labels
def build_vectorizer(self, sequences_lists, stop_w='english', min_df=0):
vectorizer = CountVectorizer(stop_words=stop_w, min_df=min_df)
vectorizer.fit(sequences_lists)
word2index = vectorizer.vocabulary_
word2index['<PAD>'] = max(word2index.values()) + 1
tokenizer = vectorizer.build_analyzer()
return word2index, tokenizer
def convert_sequence(self, sequence, word2index, tokenizer_func):
"""encode a sequence to a list of indexes"""
return [word2index[word] for word in tokenizer_func(sequence)
if word in word2index]
def pad_index(self,sequence, max_seq_len, word2index, pad_key='<PAD>'):
"""pads a sequence of indexes to max length """
return sequence + (max_seq_len - len(sequence)) * [word2index[pad_key]]
def __getitem__(self, i):
assert len(self.sequences[i]) == self.max_seq_len
return self.sequences[i], self.labels[i]
def __len__(self):
return len(self.sequences)
How can i solve this problem?
Thanks
Why can’t you use df_labels
to create the spit indices in train_test_split
?
I thought, because they are in vocabulary class, so i couldnt use train_test_split
so that it could be use in DataLoader
.
df_train
is training sequences and df_labels is targets
. Like while making dataloader I have done this
dataloader = DataLoader(dataset, batch_size=batch_size, collate_fn=collate, shuffle=True)
and i was thinking i couldnt use train_test_split instead used random_split
, but i am worrying if validation somehow would get unbalanced class.
Thanks for helping