self.Train_data
only contains the data, not the target:
self.Train_data = Train.data[rand_idx]
self.Train_data
only contains the data, not the target:
self.Train_data = Train.data[rand_idx]
if I want both images and labels, what should I do? I used rand_idx
to select random samples from train and test samples.
You could draw the data and target separately and then index them in __getitem__
:
rand_idx = torch.randperm(len(Train.data))[:10]
data = Train.data[rand_idx]
target = torch.tensor(Train.targets)[rand_idx]
However, the usual approach would be to shuffle the indices, which will be passed to __getitem__(self, index)
, so that you don’t have to shuffle it manually in your __init__
.
So you mean if I change shuffle=True
I can rewrite my code as
class MyTestDataset():
def __init__(self, transform_test=None, transform_train=None):
self.Train = datasets.CIFAR10(root='~/data', train=True,download=False,transform=transform_train)
self.Test = datasets.CIFAR10(root='~/data', train=False,download=False,transform=transform_test)
def __getitem__(self, index):
x1, y1 = self.Train[index]
x2, y2 = self.Test[index]
x = torch.stack((x1,x2))
y = torch.stack((torch.tensor(y1), torch.tensor(y2)))
return x, y, index
def __len__(self):
return len(self.Train)+len(self.Test)
dataset = MyTestDataset(transform_test=transform_test, transform_train=transform_train)
loader = torch.utils.data.DataLoader(dataset,batch_size=64,shuffle=True, num_workers=8)
but how can I limit index for the test data in __getitem__
? now it gives this error:
in __getitem__
img, target = self.data[index], self.targets[index]
IndexError: index 31041 is out of bounds for axis 0 with size 10000
Have you tried torch.utils.data.ConcatDataset to concat the two datasets?
I haven’t used it before, but I think it’s for this need?
By the way, the you are returning the concat of one training data and one testing data in your __getitem__
method, thus you get an error since the training dataset is larger than the testing dataset.
And if you code it this way, the length of the new dataset will not be the sum of two datasets.
Thank you for your guidance, @DAlolicorn .
I changed the __len__
that the problem is solved.
I also changed x = torch.stack((x1,x2))
to x = torch.cat((x1,x2),2)
since torch.stack
was generating the images with dimension torch.Size([64, 2, 3, 32, 32])
.
Regarding ConcatDataset
, I am not sure by having the current __getitem__
how I should feed datasets to ConcatDataset?
I also was trying:
for batch_idx, (inputs, targets, idx) in loader:
but it shows an error, this for loop is wrong?
Let me confirm that your objective is to get a dataset that contains both training and testing set of CIFAR-10?
Train = datasets.CIFAR10(root='~/data', train=True,download=True,transform=transform_train)
Test = datasets.CIFAR10(root='~/data', train=False,download=False,transform=transform_test)
new_set = torch.utils.data.ConcatDataset([Train, Test])
loader = torch.utils.data.DataLoader(new_set,batch_size=64,shuffle=True, num_workers=8)
for inputs, targets in loader:
#do something
The (inputs, tagets)
pair will be from either the training set or the testing set.
It’s like having a huge dataset containing both training and testing data
Thank you for your reply, @DAlolicorn.
I want to have a concatenation of test and train with the size of the test dataset for CIFAR-10. torch.utils.data.ConcatDataset
will provide the dataset with the size of both datasets? I mean 50000+10000?
I also looking for the index of images. Is there any way, to get the index in __getitem__
for each train and test dataset, not both of them one index?
Yes, the output of torch.utils.data.ConcatDataset
should contain 60000 samples.
As for the index, you mean:
for batch_idx, (inputs, targets) in enumerate(loader):
I am looking for the index of each image, I mean the index here:
def __getitem__(self, index):
x1, y1 = self.Train[index]
x2, y2 = self.Test[index]
x = torch.stack((x1,x2))
y = torch.stack((torch.tensor(y1), torch.tensor(y2)))
return x, y, index
but this index is for concatenated images, if I am correct? I want to track the index of each image apart from their concatenation.
So you want your dataset to output two images (one from the training dataset and one from the testing dataset) with the same index, and output the index as well?
I thought that you want to have a huge dataset containing both training and testing data, and you only output one image either from training or testing one time.
Yes, I want the dataset to output two images one from testing dataset and one from training dataset. I think that I was misunderstand the index part, it seems in __getitem__
the index for both images will be the same, correct?
Is this what you need?
class MyTestDataset():
def __init__(self, transform_test=None, transform_train=None):
self.Train = datasets.CIFAR10(root='~/data', train=True,download=False,transform=transform_train)
self.Test = datasets.CIFAR10(root='~/data', train=False,download=False,transform=transform_test)
def __getitem__(self, index):
x1, y1 = self.Train[index]
x2, y2 = self.Test[index]
return x1, x2, y1, y2, index
def __len__(self):
return len(self.Test)
dataset = MyTestDataset(transform_test=transform_test, transform_train=transform_train)
loader = torch.utils.data.DataLoader(dataset,batch_size=64,shuffle=True, num_workers=8)
for inputs_train, inputs_test, targets_train, targets_test, idx in loader:
Then the len should be the size of the testing set (10000), since you want the index to be the same.
This is what confuses me since you are not using the last 40000 data in the training dataset…
Thank you, @DAlolicorn.
I got the answer. I was looking for the index, that you mentioned by idx.
I think as you said that len(self.Test)
determined the size of dataset, which is 10000 in this case.
I also would like to know if I want to have testing data and the shuffled version of it (instead of training data), this is correct?
class MyTestDataset():
def __init__(self, transform_test=None):
self.Test = datasets.CIFAR10(root='~/data', train=False,download=False,transform=transform_test)
self.cifar_len = 10
rand_idx = torch.randperm(len(self.Test))[:self.cifar_len]
self.Test_shuffeled = self.Test[rand_idx]
def __getitem__(self, index):
x1, y1 = self.Test_shuffeled[index]
x2, y2 = self.Test[index]
x = torch.cat((x1,x2),2)
y = torch.stack((torch.tensor(y1), torch.tensor(y2)))
return x, y, index
def __len__(self):
return len(self.Test)
loader = torch.utils.data.DataLoader(dataset,batch_size=64,shuffle=False, num_workers=8)
I think if there are no error message out, it should be correct.
You can check by printing a pair of output, if they are not the same, it should be shuffled correctly.
Yet I would recommend just use x1, y1 = self.Test[self.rand_idx[index]]
.
In this case, you don’t need a new memory space for self.Test_shuffeled
.
when I used
rand_idx = torch.randperm(len(self.Test))[:self.cifar_len]
self.Test_shuffeled = self.Test[rand_idx]
def __getitem__(self, index):
x1, y1 = self.Test[index]
x2, y2 = self.Test_shuffeled[index]
it gives this error:
File "...", line 181, in __init__
self.Test_shuffeled = self.Test[rand_idx]
File "....", line 117, in __getitem__
img, target = self.data[index], self.targets[index]
TypeError: only integer tensors of a single element can be converted to an index
and when I tried self.Test[self.rand_idx[index]]
it gives this error:
IndexError: index 10 is out of bounds for dimension 0 with size 10
You can just use numpy:
self.rand_idx = np.arange(len(self.Test))
np.random.shuffle(self.rand_idx)
The second error is because the self.cifar_len = 10
, your index should be less then 10.
I don’t know why you limited your shuffled version to only contain the first 10 data, but if you need to do so, the len of dataset should then be 10, since now it’s 10000 it tries to request something like self.rand_idx[50] which doesn’t exists and gives you the error.
Thank you, @DAlolicorn.
Sorry, that was my mistake, I shouldn’t limite shuffled version to first 10 data. I tried,
self.rand_idx = np.arange(len(self.Test))
self.Test_shuffeled = self.Test[np.random.shuffle(self.rand_idx)]
def __getitem__(self, index):
x1, y1 = self.Test[index]
x2, y2 = self.Test_shuffeled[index]
it gives this error:
TypeError: list indices must be integers or slices, not NoneType
np.random.shuffle
is an in-place operation to shuffle self.rand_idx
:
self.rand_idx = np.arange(len(self.Test))
np.random.shuffle(self.rand_idx)
def __getitem__(self, index):
x1, y1 = self.Test[index]
x2, y2 = self.Test[self.rand_idx[index]]