Hi,
torch.utils.data.dataset.random_split returns a Subset
object which has no transforms
attribute. How can I split a Dataset
object and return another Dataset
object with the same transforms
attribute?
Thanks
Hi,
torch.utils.data.dataset.random_split returns a Subset
object which has no transforms
attribute. How can I split a Dataset
object and return another Dataset
object with the same transforms
attribute?
Thanks
I’m not sure if that would be easily achievable without writing an own derived class.
You could split your Dataset
as the first step and pass the Subset
to your main Dataset
with all transformations.
Would that work for you?
How can I pass Subset
to Dataset
?
Here is a small example:
class MyDataset(Dataset):
def __init__(self, subset, transform=None):
self.subset = subset
self.transform = transform
def __getitem__(self, index):
x, y = self.subset[index]
if self.transform:
x = self.transform(x)
return x, y
def __len__(self):
return len(self.subset)
init_dataset = TensorDataset(
torch.randn(100, 3, 24, 24),
torch.randint(0, 10, (100,))
)
lengths = [int(len(init_dataset)*0.8), int(len(init_dataset)*0.2)]
subsetA, subsetB = random_split(init_dataset, lengths)
datasetA = MyDataset(
subsetA, transform=transforms.Normalize((0., 0., 0.), (0.5, 0.5, 0.5))
)
datasetB = MyDataset(
subsetB, transform=transforms.Normalize((0., 0., 0.), (0.5, 0.5, 0.5))
)
Let me know, if that would work for you.
Thank you ptrblck. I will try this.
But I would like ask whether is it a good idea to manually split my dataset and then create two separate Dataset
objects from two different image folders? Is this a good practice?
Would you like to create separate folders for both splits?
In that case you should manually split the indices and move/copy the files to the corresponding folders.
random_split
returns splits from a single Dataset
.
It’s usually a good idea to split the data into different folders. However, in that case you won’t need random_split
, but just two separate Datasets
.
Hi Ptrbclk
Sorry I have aquestion , I passed the balanced data 4000 positive and 4000 negative as DatasetTrain to the random split train_len for 70 % and valid_len for 30 %
TrainData1, ValidationData1 = random_split(DatasetTrain,[train_len, valid_len])
After that TrainData1 and ValidationData1 can be unbalanced?
I think since it is randomly they can be unbalanced.
Indeed, I need after splitting(70% and 30% ) have
training set and validation set again in the balanced mode
The more samples you use the lower the likelihood of creating an imbalance is.
However, if you need strictly the same distribution, I would recommend to create the training and testing indices with sklearn.model_selection.train_test_split and provide the stratify
argument.
Sorry u mean that my DatasetTrain which is 4000 positive and 4000 negative if used this function it wil give me 70% balanced training and 30% balanced validation?
train_test_split( DatasetTrain,test_size=.3,train_size=.7, stratify??)
I am using:````
DatasetTrain=CMBDataClassifier(root_dirTrain,root_dirTest,split=‘train’,transforms=transform,debug=False,CounterIteration=Iteration,SubID=0,TPID=0)
train_len = int(0.7*len(DatasetTrain))
valid_len = len(DatasetTrain) - train_len
TrainData1, ValidationData1 = random_split(DatasetTrain,[train_len, valid_len])
trainloader=torch.utils.data.DataLoader(TrainData1, batch_size=32,shuffle=True,drop_last=True, num_workers=0)
validationloader=torch.utils.data.DataLoader(ValidationData1, batch_size=6, drop_last=True,num_workers=0)```
my problem is maybe TrainData1 and ValidationData1 will be unbalanced in case of positive and negative class.
If you want to make sure, both splits are balanced, you could get the target tensor and create the split indices using train_test_split
and pass the target array to stratify
.
The returned train and test indices can then be used in Subset
to create the datasets.
sorry how I can get the target tensors I am using this class to load data
import scipy
import scipy.integrate as integrate
import numpy as np
import os
from scipy import io
import matplotlib.pyplot as plt
from os.path import dirname, join as pjoin
import scipy.io as sio
import matplotlib.pyplot as plt
from torch.autograd import Variable
from os.path import dirname, join as pjoin
import scipy.io as sio
ss=os.path.isdir(root_dirDurringTraining11)
if ss==False:
os.mkdir(root_dirDurringTraining11)
class CMBDataClassifier():
def __init__(self, root_dirTrain,root_dirTest,split,transforms,debug,CounterIteration,SubID,TPID):
self.patches, self.labels = None, None
if split=='train':
patch_path = os.path.join(root_dirTrain,"Patches"+str(CounterIteration)+".mat")
label_path = os.path.join(root_dirTrain,"Labels"+str(CounterIteration)+".mat")
self.patches = scipy.io.loadmat(patch_path)['TrainPatchFinal_4']
self.labels = scipy.io.loadmat(label_path)['TargetTrainFinal_4']
else:
patch_path = os.path.join(root_dirTest,"Patches"+".mat")
label_path = os.path.join(root_dirTest,"Labels"+".mat")
self.patches = scipy.io.loadmat(patch_path)['PatchTestFinal']
self.labels = scipy.io.loadmat(label_path)['TargetTestFinal']
self.split = split
self.transforms = transforms
self.debug = debug
def __getitem__(self, index):
patchF = self.patches[:, :, : , index]
labelF = self.labels[index].astype(np.float32)
if self.transforms is not None:
patchF = self.transforms(patchF)
return patchF,labelF
def __len__(self):
return self.patches.shape[-1]````
You could iterate the dataset once and store all targets in e.g. a list or alternatively just load the targets using your current logic outside of the Dataset
.
Iu se that but give me nothing no labels
def __init__(self, root_dirTrain,root_dirTest,split,transforms,debug,CounterIteration,SubID,TPID):
self.patches, self.labels = None, None
if split=='train':
label_path = os.path.join(root_dirTrain,"Labels"+str(CounterIteration)+".mat")
self.labels = scipy.io.loadmat(label_path)['TargetTrainFinal_4']
else:
patch_path = os.path.join(root_dirTest,"Patches"+".mat")
label_path = os.path.join(root_dirTest,"Labels"+".mat")
self.patches = scipy.io.loadmat(patch_path)['PatchTestFinal']
self.labels = scipy.io.loadmat(label_path)['TargetTestFinal']
self.split = split
self.transforms = transforms
self.debug = debug
def __getitem__(self, index):
labelF = self.labels[index].astype(np.float32)
#
if self.transforms is not None:
patchF = self.transforms(patchF)
return labelF
def __len__(self):
return self.labels.shape[-1]