(how to iterate subset after random_split) TypeError: 'DataLoader' object is not subscriptable

Hi all,

This might be a trivial error, but I could not find a way to get over it, my sincere appreciation if someone can help me here.
I have run into TypeError: 'DataLoader' object is not subscriptable when trying to iterate through my training dataset after random_split the full set. This is how my full set looks like and how I randomly split it:

clean_loader.dataset
Dataset ImageFolderWithPaths
    Number of datapoints: 6929
    Root location: ./data/clean_images
    StandardTransform
Transform: Compose(
               Resize(size=128, interpolation=PIL.Image.BILINEAR)
               CenterCrop(size=(128, 128))
               ToTensor()
               Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
           )
for data in clean_loader:
    print(data)
    break
[tensor([[[[-0.3198, -0.3027, -0.5253,  ..., -0.2684,  0.5536,  1.3584],
          [-0.3369, -0.3883, -0.6965,  ..., -0.4739, -0.0629,  1.1187],
          [-0.6623, -0.6965, -0.9534,  ..., -0.4739, -0.6623,  0.1254],
          ...,
          [-0.2684,  0.9132,  1.2899,  ..., -0.5767, -0.2684,  0.2282],
          [ 0.6734,  0.7248,  1.1187,  ..., -0.3541,  0.2624,  0.6734],
          [ 0.8618,  0.6392,  0.9988,  ..., -0.3369,  0.4166,  0.6734]],

         [[-1.1954, -1.1253, -1.3354,  ..., -1.0728, -0.1800,  0.6954],
          [-1.1954, -1.2129, -1.4580,  ..., -1.2479, -0.7402,  0.4853],
          [-1.4930, -1.5105, -1.7206,  ..., -1.2304, -1.3179, -0.4601],
          ...,
          [-1.0028,  0.2927,  0.9055,  ..., -1.3004, -0.9853, -0.4776],
          [ 0.0126,  0.1527,  0.7479,  ..., -1.1429, -0.5126, -0.0924],
          [ 0.2402,  0.0826,  0.6429,  ..., -1.1253, -0.3725, -0.1099]],

         [[ 0.2348,  0.1825, -0.2010,  ...,  0.3393,  1.2805,  2.1346],
          [ 0.1651,  0.0605, -0.3404,  ...,  0.1825,  0.6182,  1.8383],
          [-0.2532, -0.3055, -0.6018,  ...,  0.2348, -0.0267,  0.7402],
          ...,
          [ 0.3742,  1.5420,  1.8905,  ...,  0.2173,  0.5659,  1.0714],
          [ 1.5420,  1.5594,  1.8731,  ...,  0.5136,  1.1585,  1.6117],
          [ 1.8383,  1.5420,  1.8383,  ...,  0.5834,  1.3851,  1.6814]]]]), tensor([0]), ('1',)]

I used random_split to split the full set:

clean_train_set, clean_test_set = torch.utils.data.random_split(clean_loader, [round(len(clean_loader)*0.8),(len(clean_loader) - round(len(clean_loader)*0.8))])

However, after the split, I cannot iterate through the subset:

for data in clean_train_set:
    print(data)
    break
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
 in 
----> 1 for data in clean_train_set:
      2     print(data)
      3     break

C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataset.py in __getitem__(self, idx)
    255 
    256     def __getitem__(self, idx):
--> 257         return self.dataset[self.indices[idx]]
    258 
    259     def __len__(self):

TypeError: 'DataLoader' object is not subscriptable

I have also tried the enumerate way to do the iteration:

for step, data in enumerate(clean_train_set):
    print(data)
    break

But get the same error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
 in 
----> 1 for step, data in enumerate(clean_train_set):
      2     print(data)
      3     break

C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataset.py in __getitem__(self, idx)
    255 
    256     def __getitem__(self, idx):
--> 257         return self.dataset[self.indices[idx]]
    258 
    259     def __len__(self):

TypeError: 'DataLoader' object is not subscriptable

I guess this might be an elephant-in-the-room issue, but I have googled thoroughly and cannot find a cure. Urgently need some help to get over this error, thanks a bunch!

5 Likes

I have also tried to pass the Subset to a DataLoader, but still no luck:

clean_train_loader = torch.utils.data.DataLoader(
    clean_train_set,
    batch_size=1,
    shuffle=True,
    num_workers=0,
)
for step, data in enumerate(clean_train_loader):
    print(data)
    break
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
 in 
----> 1 for step, data in enumerate(clean_train_loader):
      2     print(data)
      3     break

C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __next__(self)
    343 
    344     def __next__(self):
--> 345         data = self._next_data()
    346         self._num_yielded += 1
    347         if self._dataset_kind == _DatasetKind.Iterable and \

C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _next_data(self)
    383     def _next_data(self):
    384         index = self._next_index()  # may raise StopIteration
--> 385         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    386         if self._pin_memory:
    387             data = _utils.pin_memory.pin_memory(data)

C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\_utils\fetch.py in fetch(self, possibly_batched_index)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\_utils\fetch.py in (.0)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataset.py in __getitem__(self, idx)
    255 
    256     def __getitem__(self, idx):
--> 257         return self.dataset[self.indices[idx]]
    258 
    259     def __len__(self):

TypeError: 'DataLoader' object is not subscriptable

I can somehow iterate over the dataset using clean_train_loader.dataset.dataset, but it seems like it is actually the original full set which has been enumerated over, not the subset:

print(len(clean_loader))
print(len(clean_train_loader))
i = 0
for step, data in enumerate(clean_train_loader.dataset.dataset):
    i += 1
print(i)
6929
5543
6929

I found some discussion referring that by doing data_loader.dataset.dataset it traces back to the original full set, but if such being the case, how can I iterate through the subset per se? I am really confused here. Desperately need some help here. :disappointed_relieved:

BTW, I have also tried out the SubsetRandomSampler, but the results are somehow the same:

from torch.utils.data.sampler import SubsetRandomSampler
test_split = .2
shuffle_dataset = True
random_seed= 42

# Creating data indices for training and validation splits:
dataset_size = len(clean_loader)
indices = list(range(dataset_size))
split = int(np.floor(test_split * dataset_size))
if shuffle_dataset :
    np.random.seed(random_seed)
    np.random.shuffle(indices)
train_indices, test_indices = indices[split:], indices[:split]

# Creating PT data samplers and loaders:
train_sampler = SubsetRandomSampler(train_indices)
test_sampler = SubsetRandomSampler(test_indices)

train_loader = torch.utils.data.DataLoader(clean_loader, batch_size=1, 
                                           sampler=train_sampler)
test_loader = torch.utils.data.DataLoader(clean_loader, batch_size=1,
                                                sampler=test_sampler)
i = 0
for data in train_loader.dataset:
    i += 1
print(i)

i = 0
for step, data in enumerate(train_loader.dataset):
    i += 1
print(i)

i = 0
for (images, labels, index) in train_loader.dataset:
    i += 1
print(i)

for data in train_loader:
    print(data)
6929
6929
6929
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
 in 
----> 1 for data in train_loader:
      2     print(data)

C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __next__(self)
    343 
    344     def __next__(self):
--> 345         data = self._next_data()
    346         self._num_yielded += 1
    347         if self._dataset_kind == _DatasetKind.Iterable and \

C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _next_data(self)
    383     def _next_data(self):
    384         index = self._next_index()  # may raise StopIteration
--> 385         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    386         if self._pin_memory:
    387             data = _utils.pin_memory.pin_memory(data)

C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\_utils\fetch.py in fetch(self, possibly_batched_index)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\_utils\fetch.py in (.0)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

TypeError: 'DataLoader' object is not subscriptable

Basically, I have tried every style of iteration approach I could find, but still with no luck

Solved the problem. It turns out need to pass clean_loader.dataset instead of clean_loader to DataLoader. Just would like to document it here in case someone else also get caught with the same mistake.

8 Likes