Torch.utils.data.dataset.random_split

Hi,

torch.utils.data.dataset.random_split returns a Subset object which has no transforms attribute. How can I split a Dataset object and return another Dataset object with the same transforms attribute?

Thanks

1 Like

I’m not sure if that would be easily achievable without writing an own derived class.
You could split your Dataset as the first step and pass the Subset to your main Dataset with all transformations.
Would that work for you?

How can I pass Subset to Dataset?

Here is a small example:

class MyDataset(Dataset):
    def __init__(self, subset, transform=None):
        self.subset = subset
        self.transform = transform
        
    def __getitem__(self, index):
        x, y = self.subset[index]
        if self.transform:
            x = self.transform(x)
        return x, y
        
    def __len__(self):
        return len(self.subset)

    
init_dataset = TensorDataset(
    torch.randn(100, 3, 24, 24),
    torch.randint(0, 10, (100,))
)

lengths = [int(len(init_dataset)*0.8), int(len(init_dataset)*0.2)]
subsetA, subsetB = random_split(init_dataset, lengths)
datasetA = MyDataset(
    subsetA, transform=transforms.Normalize((0., 0., 0.), (0.5, 0.5, 0.5))
)
datasetB = MyDataset(
    subsetB, transform=transforms.Normalize((0., 0., 0.), (0.5, 0.5, 0.5))
)

Let me know, if that would work for you.

19 Likes

Thank you ptrblck. I will try this.
But I would like ask whether is it a good idea to manually split my dataset and then create two separate Dataset objects from two different image folders? Is this a good practice?

2 Likes

Would you like to create separate folders for both splits?
In that case you should manually split the indices and move/copy the files to the corresponding folders.
random_split returns splits from a single Dataset.

It’s usually a good idea to split the data into different folders. However, in that case you won’t need random_split, but just two separate Datasets.

Hi Ptrbclk

Sorry I have aquestion , I passed the balanced data 4000 positive and 4000 negative as DatasetTrain to the random split train_len for 70 % and valid_len for 30 %

TrainData1, ValidationData1 = random_split(DatasetTrain,[train_len, valid_len])

After that TrainData1 and ValidationData1 can be unbalanced?
I think since it is randomly they can be unbalanced.

Indeed, I need after splitting(70% and 30% ) have
training set and validation set again in the balanced mode

The more samples you use the lower the likelihood of creating an imbalance is.
However, if you need strictly the same distribution, I would recommend to create the training and testing indices with sklearn.model_selection.train_test_split and provide the stratify argument.

Sorry u mean that my DatasetTrain which is 4000 positive and 4000 negative if used this function it wil give me 70% balanced training and 30% balanced validation?

train_test_split( DatasetTrain,test_size=.3,train_size=.7, stratify??)

I am using:````
DatasetTrain=CMBDataClassifier(root_dirTrain,root_dirTest,split=‘train’,transforms=transform,debug=False,CounterIteration=Iteration,SubID=0,TPID=0)

train_len = int(0.7*len(DatasetTrain))
valid_len = len(DatasetTrain) - train_len

TrainData1, ValidationData1 = random_split(DatasetTrain,[train_len, valid_len])

trainloader=torch.utils.data.DataLoader(TrainData1, batch_size=32,shuffle=True,drop_last=True, num_workers=0)

validationloader=torch.utils.data.DataLoader(ValidationData1, batch_size=6, drop_last=True,num_workers=0)```

my problem is maybe TrainData1 and ValidationData1 will be unbalanced in case of positive and negative class.

If you want to make sure, both splits are balanced, you could get the target tensor and create the split indices using train_test_split and pass the target array to stratify.
The returned train and test indices can then be used in Subset to create the datasets.

sorry how I can get the target tensors I am using this class to load data

import scipy
import scipy.integrate as integrate
import numpy as np
import os
from scipy import io
import matplotlib.pyplot as plt
from os.path import dirname, join as pjoin
import scipy.io as sio
import matplotlib.pyplot as plt
from torch.autograd import Variable
from os.path import dirname, join as pjoin
import scipy.io as sio

ss=os.path.isdir(root_dirDurringTraining11)
if ss==False:
   os.mkdir(root_dirDurringTraining11)
   
class CMBDataClassifier():

   def __init__(self, root_dirTrain,root_dirTest,split,transforms,debug,CounterIteration,SubID,TPID):
       self.patches, self.labels = None, None
       if split=='train':
           patch_path = os.path.join(root_dirTrain,"Patches"+str(CounterIteration)+".mat")
           label_path = os.path.join(root_dirTrain,"Labels"+str(CounterIteration)+".mat")
           self.patches = scipy.io.loadmat(patch_path)['TrainPatchFinal_4']
           self.labels = scipy.io.loadmat(label_path)['TargetTrainFinal_4']

       else:
           patch_path = os.path.join(root_dirTest,"Patches"+".mat")
           label_path = os.path.join(root_dirTest,"Labels"+".mat")
           self.patches = scipy.io.loadmat(patch_path)['PatchTestFinal']
           
           self.labels = scipy.io.loadmat(label_path)['TargetTestFinal']

       self.split = split
       self.transforms = transforms
       self.debug = debug
       
   def __getitem__(self, index):

       patchF = self.patches[:, :, : , index]

       labelF = self.labels[index].astype(np.float32)
   
       if self.transforms is not None:
           patchF = self.transforms(patchF)

       return patchF,labelF


   def __len__(self):
       return self.patches.shape[-1]````

You could iterate the dataset once and store all targets in e.g. a list or alternatively just load the targets using your current logic outside of the Dataset.

Iu se that but give me nothing no labels


    def __init__(self, root_dirTrain,root_dirTest,split,transforms,debug,CounterIteration,SubID,TPID):
        self.patches, self.labels = None, None
        if split=='train':


            label_path = os.path.join(root_dirTrain,"Labels"+str(CounterIteration)+".mat")

            self.labels = scipy.io.loadmat(label_path)['TargetTrainFinal_4']

        else:
            patch_path = os.path.join(root_dirTest,"Patches"+".mat")
            label_path = os.path.join(root_dirTest,"Labels"+".mat")
            self.patches = scipy.io.loadmat(patch_path)['PatchTestFinal']

            self.labels = scipy.io.loadmat(label_path)['TargetTestFinal']

        self.split = split
        self.transforms = transforms
        self.debug = debug
        
    def __getitem__(self, index):
        
 
        labelF = self.labels[index].astype(np.float32)

#        
        if self.transforms is not None:
            patchF = self.transforms(patchF)

        return labelF


    def __len__(self):
        return self.labels.shape[-1]