Hi,
torch.utils.data.dataset.random_split returns a Subset object which has no transforms attribute. How can I split a Dataset object and return another Dataset object with the same transforms attribute?
Thanks
Hi,
torch.utils.data.dataset.random_split returns a Subset object which has no transforms attribute. How can I split a Dataset object and return another Dataset object with the same transforms attribute?
Thanks
I’m not sure if that would be easily achievable without writing an own derived class.
You could split your Dataset as the first step and pass the Subset to your main Dataset with all transformations.
Would that work for you?
How can I pass Subset to Dataset?
Here is a small example:
class MyDataset(Dataset):
def __init__(self, subset, transform=None):
self.subset = subset
self.transform = transform
def __getitem__(self, index):
x, y = self.subset[index]
if self.transform:
x = self.transform(x)
return x, y
def __len__(self):
return len(self.subset)
init_dataset = TensorDataset(
torch.randn(100, 3, 24, 24),
torch.randint(0, 10, (100,))
)
lengths = [int(len(init_dataset)*0.8), int(len(init_dataset)*0.2)]
subsetA, subsetB = random_split(init_dataset, lengths)
datasetA = MyDataset(
subsetA, transform=transforms.Normalize((0., 0., 0.), (0.5, 0.5, 0.5))
)
datasetB = MyDataset(
subsetB, transform=transforms.Normalize((0., 0., 0.), (0.5, 0.5, 0.5))
)
Let me know, if that would work for you.
Thank you ptrblck. I will try this.
But I would like ask whether is it a good idea to manually split my dataset and then create two separate Dataset objects from two different image folders? Is this a good practice?
Would you like to create separate folders for both splits?
In that case you should manually split the indices and move/copy the files to the corresponding folders.
random_split returns splits from a single Dataset.
It’s usually a good idea to split the data into different folders. However, in that case you won’t need random_split, but just two separate Datasets.
Hi Ptrbclk
Sorry I have aquestion , I passed the balanced data 4000 positive and 4000 negative as DatasetTrain to the random split train_len for 70 % and valid_len for 30 %
TrainData1, ValidationData1 = random_split(DatasetTrain,[train_len, valid_len])
After that TrainData1 and ValidationData1 can be unbalanced?
I think since it is randomly they can be unbalanced.
Indeed, I need after splitting(70% and 30% ) have
training set and validation set again in the balanced mode
The more samples you use the lower the likelihood of creating an imbalance is.
However, if you need strictly the same distribution, I would recommend to create the training and testing indices with sklearn.model_selection.train_test_split and provide the stratify argument.
Sorry u mean that my DatasetTrain which is 4000 positive and 4000 negative if used this function it wil give me 70% balanced training and 30% balanced validation?
train_test_split( DatasetTrain,test_size=.3,train_size=.7, stratify??)
I am using:````
DatasetTrain=CMBDataClassifier(root_dirTrain,root_dirTest,split=‘train’,transforms=transform,debug=False,CounterIteration=Iteration,SubID=0,TPID=0)
train_len = int(0.7*len(DatasetTrain))
valid_len = len(DatasetTrain) - train_len
TrainData1, ValidationData1 = random_split(DatasetTrain,[train_len, valid_len])
trainloader=torch.utils.data.DataLoader(TrainData1, batch_size=32,shuffle=True,drop_last=True, num_workers=0)
validationloader=torch.utils.data.DataLoader(ValidationData1, batch_size=6, drop_last=True,num_workers=0)```
my problem is maybe TrainData1 and ValidationData1 will be unbalanced in case of positive and negative class.
If you want to make sure, both splits are balanced, you could get the target tensor and create the split indices using train_test_split and pass the target array to stratify.
The returned train and test indices can then be used in Subset to create the datasets.
sorry how I can get the target tensors I am using this class to load data
import scipy
import scipy.integrate as integrate
import numpy as np
import os
from scipy import io
import matplotlib.pyplot as plt
from os.path import dirname, join as pjoin
import scipy.io as sio
import matplotlib.pyplot as plt
from torch.autograd import Variable
from os.path import dirname, join as pjoin
import scipy.io as sio
ss=os.path.isdir(root_dirDurringTraining11)
if ss==False:
os.mkdir(root_dirDurringTraining11)
class CMBDataClassifier():
def __init__(self, root_dirTrain,root_dirTest,split,transforms,debug,CounterIteration,SubID,TPID):
self.patches, self.labels = None, None
if split=='train':
patch_path = os.path.join(root_dirTrain,"Patches"+str(CounterIteration)+".mat")
label_path = os.path.join(root_dirTrain,"Labels"+str(CounterIteration)+".mat")
self.patches = scipy.io.loadmat(patch_path)['TrainPatchFinal_4']
self.labels = scipy.io.loadmat(label_path)['TargetTrainFinal_4']
else:
patch_path = os.path.join(root_dirTest,"Patches"+".mat")
label_path = os.path.join(root_dirTest,"Labels"+".mat")
self.patches = scipy.io.loadmat(patch_path)['PatchTestFinal']
self.labels = scipy.io.loadmat(label_path)['TargetTestFinal']
self.split = split
self.transforms = transforms
self.debug = debug
def __getitem__(self, index):
patchF = self.patches[:, :, : , index]
labelF = self.labels[index].astype(np.float32)
if self.transforms is not None:
patchF = self.transforms(patchF)
return patchF,labelF
def __len__(self):
return self.patches.shape[-1]````
You could iterate the dataset once and store all targets in e.g. a list or alternatively just load the targets using your current logic outside of the Dataset.
Iu se that but give me nothing no labels
def __init__(self, root_dirTrain,root_dirTest,split,transforms,debug,CounterIteration,SubID,TPID):
self.patches, self.labels = None, None
if split=='train':
label_path = os.path.join(root_dirTrain,"Labels"+str(CounterIteration)+".mat")
self.labels = scipy.io.loadmat(label_path)['TargetTrainFinal_4']
else:
patch_path = os.path.join(root_dirTest,"Patches"+".mat")
label_path = os.path.join(root_dirTest,"Labels"+".mat")
self.patches = scipy.io.loadmat(patch_path)['PatchTestFinal']
self.labels = scipy.io.loadmat(label_path)['TargetTestFinal']
self.split = split
self.transforms = transforms
self.debug = debug
def __getitem__(self, index):
labelF = self.labels[index].astype(np.float32)
#
if self.transforms is not None:
patchF = self.transforms(patchF)
return labelF
def __len__(self):
return self.labels.shape[-1]