Issues with torch.utils.data.random_split

torch==1.3.1
torchvision==0.4.2

Hi @ptrblck, the issue is resolved. It is working for me also. I was finding the length using:

print(len(train_set.dataset)) 

which gives the length of the parent dataset. I wanted to convert the object of Subset class to Dataset object.
Is there a way to convert the Subset to Dataset object?

1 Like

Subset wraps the Dataset for the reason to apply the specified indices to get the subset of the samples.
What is your use case that you want to revert it?
You can pass the Subset to a DataLoader, if that’s an issue.

I want to fetch the entire dataset in x and y. Dataset class has .data and .targets attributes which resolves the purpose.
DataLoader with batch size set to len(dataset) will be used.

The DataLoader will use the length of the Subset, not the underlying Dataset.
If you want to fetch underlying data to process it or for some other use case, I would recommend to use it before splitting.

Could you explain your use case a bit more, so that I understand, why you need to access the internal data after splitting?

@ptrblck
My use case is to first divide the dataset into two different subsets, then for each subset,
Each subset should have the __getitem__ function such that, to load a batch of samples, the __getitem__ function to return pair of samples and these pair of samples belong to the same class, i.e. batch of 4 would mean a total of 8 samples. These are paired samples belonging to the same class.

Example: from MNIST Dataset, a batch would mean (1, 1), (2, 2), (7, 7) and (9, 9).

Your post on Torch.utils.data.dataset.random_split resolves the issue of dividing the dataset into two subsets and using the __getitem__ function for the individual subsets. But Can you help with the workaround of using index in __getitem__ to return the pairs from the same class.

Thanks.

@ptrblck I had a same issue as Ashima. It seems I was checking the len(dataloader.dataset). However the dimensions still doesn’t look right. I am trying to split 200k rows into 160k of train and 40k of val.

I am not sure why I see 40k and 10k.

2 Likes

Hi, I would like to split my dataset in a train and validation part, where both subset indices should be in range of 0 to len(train_data) or 0 to len(validation_data). Is there a method that I can use for this?

If you wrap your Dataset into a Subset, you could pass the training and validation indices to it.
Each Subset will accept indices in the range [0, len(subset)].

The passed indices to create the Subsets should of course not be overlapping.

I have a dataset folder(with ‘.jpg’ and .xml files) for object detection and I want to split it into train and validation set with respect to file names like we do by creating separate train and test folder,I also tried torch.utils.data.random_split but it splits by xml object index not by images.Is there any method to do this?

I assume each image file has a corresponding xml file with the annotations for the object detection task?
If so, I would recommend to create a custom dataset and create the mapping between the image paths and xml paths inside the Dataset.__init__ method.
Once you have this mapping, you could load each image and its annotation in the Dataset.__getitem__ method.

To split the dataset into a training, validation, and test set you could create all indices and shuffle them via torch.randperm(len(dataset)) or alternatively you could also use e.g. sklearn.model_selection.train_test_split, which has a few more options.
These indices can then be passed to a Subset, which wraps the dataset to create the dataset splits.
Alternatively you could also use a SubsetRandomSampler and pass these samplers to the DataLoader while creating the training, validation, and test dataloaders.

Let me know, if that would work for you.

Could you please help me with the sample code?

Could you post one or two dummy inputs with their right format, please?

sure.Right now I am using this :

trainset = core.Dataset(dataset_path)
train_len=int(len(trainset)*0.8)
test_len=len(trainset)-int(len(trainset)*0.8)
train_set=torch.utils.data.Subset(trainset,range(0,train_len))
val_set=torch.utils.data.Subset(trainset,range(train_len,len(trainset)))

But I want to shuffle train and val set wrt file names because if I shuffle them by indices then some indices of a common file might be in both train and test.

for example: a file ‘cat1.jpg’ has 3 cats on indices 0,1,and 2 in ‘cat1.xml’ so I dont want 0,1 indices in train and 3 indices in test or validation set…I want all three indices of same file either in train or test set

I assume you’ve already created the dataset and are able to load each sample?
If so, you could use sklearn.model_selection.GroupShuffleSplit, which takes an additional groups argument to the split method in order to create the training and test indices.
For the groups you could use the file name passed as indices.
Once you have the indices, you can pass them to the Subset.

1 Like

It worked. Thanks a lot :blush:

I’ve created a script which is given here: How to split dataset into test and validation sets

Does it splits each class in 80:20 ratio or just randomly splits whole dataset in 80: 20 ?

train_dataset,test_dataset=torch.utils.data.random_split(ants_dataset,(train_length,test_length))

It splits the data randomly. If you want to apply a stratified split, you could use sklearn.model_selection.train_test_split and provide the stratify argument to create the training and validation indices, which can then be used in a Subset or RandomSubsetSampler.

3 Likes

should I use this? I also want to split data by filenames as well.

from sklearn.model_selection import GroupShuffleSplit
train_index, test_index = next(GroupShuffleSplit(n_splits=1, test_size=0.2,random_state=15).split(trainset_df, groups=trainset_df.filename))

train_set=torch.utils.data.Subset(trainset,train_index)
val_set=torch.utils.data.Subset(trainset,test_index)

1 Like