TypeError: 'Dataset' object does not support indexing

Hi,

I want to concatenate testing samples and training samples (CIFAR-10), and then using this dataset in the test. I used __getitem :

class MyTestDataset():
        def __init__(self, transform_test=None, transform_train=None):
            Train = datasets.CIFAR10(root='~/data', train=True,download=True,transform=transform_train)
            Test = datasets.CIFAR10(root='~/data', train=False,download=False,transform=transform_test)
            self.cifar_len = 10
        
            rand_idx = torch.randperm(len(Train.data))[:self.cifar_len]
            self.Train_data = Train.data[rand_idx]
            
            rand_idxt = torch.randperm(len(Test.data))[:self.cifar_len]
            self.Test_data = Test.data[rand_idxt]

        def __getitem(self, index):    
            x1, y1 = self.Train_data[index]
            
            x2, y2 = self.Test_data[index]
            
            x = torch.stack((x1, x2))
            y = torch.stack((torch.tensor(y1), torch.tensor(y2)))
            
            return x, y

        def __len__(self):
            return self.cifar_len + self.cifar_len
 dataset = MyTestDataset(transform_test=transform_test, transform_train=transform_train) 
    
loader = torch.utils.data.DataLoader(dataset,batch_size=64,shuffle=False, num_workers=8)

and in the test, I used:

for inputs, targets in loader:

but this gives error:

  File "...", line 339, in test
    for inputs, targets in loader:
  File "...", line 637, in __next__
    return self._process_next_batch(batch)
  File "...", line 658, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
TypeError: Traceback (most recent call last):
  File "...", line 138, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "...", line 138, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
TypeError: 'MyTestDataset' object does not support indexing

The __getitem__ definition is missing the two underscores on the right hand side.

1 Like

Thank you for your reply, @ptrblck.

I fixed that, now it’s giving this error:

line 190, in __getitem__
    x1, y1 = self.Train_data[index]
ValueError: too many values to unpack (expected 2)

self.Train_data only contains the data, not the target:

self.Train_data = Train.data[rand_idx]

if I want both images and labels, what should I do? I used rand_idx to select random samples from train and test samples.

You could draw the data and target separately and then index them in __getitem__:

rand_idx = torch.randperm(len(Train.data))[:10]
data = Train.data[rand_idx]            
target = torch.tensor(Train.targets)[rand_idx]

However, the usual approach would be to shuffle the indices, which will be passed to __getitem__(self, index), so that you don’t have to shuffle it manually in your __init__.

So you mean if I change shuffle=True I can rewrite my code as

class MyTestDataset():
        def __init__(self, transform_test=None, transform_train=None):
            self.Train = datasets.CIFAR10(root='~/data', train=True,download=False,transform=transform_train)
            self.Test = datasets.CIFAR10(root='~/data', train=False,download=False,transform=transform_test)

        def __getitem__(self, index):    
            x1, y1 = self.Train[index]
            x2, y2 = self.Test[index]
            
            x = torch.stack((x1,x2))
            y = torch.stack((torch.tensor(y1), torch.tensor(y2)))
            
            return x, y, index

        def __len__(self):
            return len(self.Train)+len(self.Test)     
        

dataset = MyTestDataset(transform_test=transform_test, transform_train=transform_train) 
    
loader = torch.utils.data.DataLoader(dataset,batch_size=64,shuffle=True, num_workers=8)

but how can I limit index for the test data in __getitem__? now it gives this error:

in __getitem__
    img, target = self.data[index], self.targets[index]
IndexError: index 31041 is out of bounds for axis 0 with size 10000

Have you tried torch.utils.data.ConcatDataset to concat the two datasets?
I haven’t used it before, but I think it’s for this need?

By the way, the you are returning the concat of one training data and one testing data in your __getitem__ method, thus you get an error since the training dataset is larger than the testing dataset.
And if you code it this way, the length of the new dataset will not be the sum of two datasets.

Thank you for your guidance, @DAlolicorn .

I changed the __len__ that the problem is solved.
I also changed x = torch.stack((x1,x2)) to x = torch.cat((x1,x2),2) since torch.stack was generating the images with dimension torch.Size([64, 2, 3, 32, 32]).

Regarding ConcatDataset , I am not sure by having the current __getitem__ how I should feed datasets to ConcatDataset?

I also was trying:

for batch_idx, (inputs, targets, idx) in loader:

but it shows an error, this for loop is wrong?

Let me confirm that your objective is to get a dataset that contains both training and testing set of CIFAR-10?

Train = datasets.CIFAR10(root='~/data', train=True,download=True,transform=transform_train)
Test = datasets.CIFAR10(root='~/data', train=False,download=False,transform=transform_test)
new_set = torch.utils.data.ConcatDataset([Train, Test])
loader = torch.utils.data.DataLoader(new_set,batch_size=64,shuffle=True, num_workers=8)
for inputs, targets in loader:
    #do something

The (inputs, tagets) pair will be from either the training set or the testing set.
It’s like having a huge dataset containing both training and testing data

1 Like

Thank you for your reply, @DAlolicorn.

I want to have a concatenation of test and train with the size of the test dataset for CIFAR-10. torch.utils.data.ConcatDataset will provide the dataset with the size of both datasets? I mean 50000+10000?

I also looking for the index of images. Is there any way, to get the index in __getitem__ for each train and test dataset, not both of them one index?

Yes, the output of torch.utils.data.ConcatDataset should contain 60000 samples.

As for the index, you mean:

for batch_idx, (inputs, targets) in enumerate(loader):

I am looking for the index of each image, I mean the index here:

def __getitem__(self, index):    
            x1, y1 = self.Train[index]
            x2, y2 = self.Test[index]
            
            x = torch.stack((x1,x2))
            y = torch.stack((torch.tensor(y1), torch.tensor(y2)))
            
            return x, y, index

but this index is for concatenated images, if I am correct? I want to track the index of each image apart from their concatenation.

So you want your dataset to output two images (one from the training dataset and one from the testing dataset) with the same index, and output the index as well?

I thought that you want to have a huge dataset containing both training and testing data, and you only output one image either from training or testing one time.

Yes, I want the dataset to output two images one from testing dataset and one from training dataset. I think that I was misunderstand the index part, it seems in __getitem__ the index for both images will be the same, correct?

Is this what you need?

class MyTestDataset():
        def __init__(self, transform_test=None, transform_train=None):
            self.Train = datasets.CIFAR10(root='~/data', train=True,download=False,transform=transform_train)
            self.Test = datasets.CIFAR10(root='~/data', train=False,download=False,transform=transform_test)

        def __getitem__(self, index):    
            x1, y1 = self.Train[index]
            x2, y2 = self.Test[index]
            return x1, x2, y1, y2, index

        def __len__(self):
            return len(self.Test)     
        

dataset = MyTestDataset(transform_test=transform_test, transform_train=transform_train) 
    
loader = torch.utils.data.DataLoader(dataset,batch_size=64,shuffle=True, num_workers=8)

for inputs_train, inputs_test, targets_train, targets_test, idx in loader:

Then the len should be the size of the testing set (10000), since you want the index to be the same.
This is what confuses me since you are not using the last 40000 data in the training dataset…

1 Like

Thank you, @DAlolicorn.

I got the answer. I was looking for the index, that you mentioned by idx.
I think as you said that len(self.Test) determined the size of dataset, which is 10000 in this case.

I also would like to know if I want to have testing data and the shuffled version of it (instead of training data), this is correct?

class MyTestDataset():
        def __init__(self, transform_test=None):
            self.Test = datasets.CIFAR10(root='~/data', train=False,download=False,transform=transform_test)
            self.cifar_len = 10
         
            rand_idx = torch.randperm(len(self.Test))[:self.cifar_len]
            self.Test_shuffeled = self.Test[rand_idx]
           

        def __getitem__(self, index):    
            x1, y1 = self.Test_shuffeled[index]
            x2, y2 = self.Test[index]
            x = torch.cat((x1,x2),2)
            y = torch.stack((torch.tensor(y1), torch.tensor(y2)))
            
            return x, y, index
        
        def __len__(self):
            return len(self.Test)
 loader = torch.utils.data.DataLoader(dataset,batch_size=64,shuffle=False, num_workers=8)

I think if there are no error message out, it should be correct.
You can check by printing a pair of output, if they are not the same, it should be shuffled correctly.

Yet I would recommend just use x1, y1 = self.Test[self.rand_idx[index]].
In this case, you don’t need a new memory space for self.Test_shuffeled.

when I used

            rand_idx = torch.randperm(len(self.Test))[:self.cifar_len]
            self.Test_shuffeled = self.Test[rand_idx]
          
        def __getitem__(self, index):    
            x1, y1 = self.Test[index]
            x2, y2 = self.Test_shuffeled[index]

it gives this error:

 File "...", line 181, in __init__
    self.Test_shuffeled = self.Test[rand_idx]
  File "....", line 117, in __getitem__
    img, target = self.data[index], self.targets[index]
TypeError: only integer tensors of a single element can be converted to an index

and when I tried self.Test[self.rand_idx[index]] it gives this error:

IndexError: index 10 is out of bounds for dimension 0 with size 10

You can just use numpy:

self.rand_idx = np.arange(len(self.Test))
np.random.shuffle(self.rand_idx)

The second error is because the self.cifar_len = 10, your index should be less then 10.
I don’t know why you limited your shuffled version to only contain the first 10 data, but if you need to do so, the len of dataset should then be 10, since now it’s 10000 it tries to request something like self.rand_idx[50] which doesn’t exists and gives you the error.

1 Like