Issues with torch.utils.data.random_split

mariosfourn · August 2, 2018, 2:37pm

I’m getting an error when I use a DataLoader based on split dataset through torch.utils.data.random_split().
I have tried creating a dataloader with the un-split dataset and everything works fine. So I assume it’s something do with the splitting.

I have a custom dataset defined as:

 class AntsDataset(Dataset):

    def __init__(self, root_dir, csv_file, transform=None):
        """
        Args:
            csv_file (string): Path to the csv_file with rotations
            root_dir (string): Directory with all the images.
            transform (callable, optional): Optional transform to be applied
                on a sample.
        """
        self.rotations = pd.read_csv(csv_file,header=None)
        self.root_dir = root_dir
        self.transform = transform

    def __len__(self):
        return len(self.rotations)

    def __getitem__(self, idx):
        #import ipdb; ipdb.set_trace()
        img_name = os.path.join(self.root_dir,
                                self.rotations.iloc[idx, 0])
        image = plt.imread(img_name,format='RGB')
        rotation = self.rotations.iloc[idx, 1].astype('float')

        if self.transform is not None:
            image=self.transform(image)

        return (image, rotation)

I then create a dataset, split and form a data_loader:

ants_dataset=AntsDataset(ants1_root_dir, ants1_rot_file,
        transform=transforms.Compose([transforms.ToPILImage(),
        transforms.Resize((120,120)),
        transforms.RandomCrop(size=100, pad_if_needed=True),
        transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.1, hue=0.07),
        transforms.ToTensor()]))

dataloader=torch.utils.data.DataLoader(ants_dataset,
        batch_size=10, shuffle=True)

train_length=int(0.7* len(ants_dataset))

test_length=len(ants_dataset)-train_length

train_dataset,test_dataset=torch.utils.data.random_split(ants_dataset,(train_length,test_length))

dataloader_train=torch.utils.data.DataLoader(train_dataset,
        batch_size=10, shuffle=True)

for batch_idx, (data,rotations) in enumerate(dataloader_train):
    print(rotations)

When I try to loop over the dataloader I get the following error:

--------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-82-f629e71651de> in <module>()
      1 dataloader_train=torch.utils.data.DataLoader(train_dataset,
      2         batch_size=10, shuffle=True)
----> 3 for batch_idx, (data,rotations) in enumerate(dataloader_train):
      4     print(rotations)

~/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
    312         if self.num_workers == 0:  # same-process loading
    313             indices = next(self.sample_iter)  # may raise StopIteration
--> 314             batch = self.collate_fn([self.dataset[i] for i in indices])
    315             if self.pin_memory:
    316                 batch = pin_memory_batch(batch)

~/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py in <listcomp>(.0)
    312         if self.num_workers == 0:  # same-process loading
    313             indices = next(self.sample_iter)  # may raise StopIteration
--> 314             batch = self.collate_fn([self.dataset[i] for i in indices])
    315             if self.pin_memory:
    316                 batch = pin_memory_batch(batch)

~/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataset.py in __getitem__(self, idx)
    101 
    102     def __getitem__(self, idx):
--> 103         return self.dataset[self.indices[idx]]
    104 
    105     def __len__(self):

<ipython-input-44-01f6586a3276> in __getitem__(self, idx)
     20         #import ipdb; ipdb.set_trace()
     21         img_name = os.path.join(self.root_dir,
---> 22                                 self.rotations.iloc[idx, 0])
     23         image = plt.imread(img_name,format='RGB')
     24         rotation = self.rotations.iloc[idx, 1].astype('float')

~/anaconda3/envs/pytorch/lib/python3.6/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1470             except (KeyError, IndexError):
   1471                 pass
-> 1472             return self._getitem_tuple(key)
   1473         else:
   1474             # we by definition only have the 0th axis

~/anaconda3/envs/pytorch/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
   2011     def _getitem_tuple(self, tup):
   2012 
-> 2013         self._has_valid_tuple(tup)
   2014         try:
   2015             return self._getitem_lowerdim(tup)

~/anaconda3/envs/pytorch/lib/python3.6/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
    220                 raise IndexingError('Too many indexers')
    221             try:
--> 222                 self._validate_key(k, i)
    223             except ValueError:
    224                 raise ValueError("Location based indexing can only have "

~/anaconda3/envs/pytorch/lib/python3.6/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
   1965             l = len(self.obj._get_axis(axis))
   1966 
-> 1967             if len(arr) and (arr.max() >= l or arr.min() < -l):
   1968                 raise IndexError("positional indexers are out-of-bounds")
   1969         else:

TypeError: len() of unsized object

ptrblck · August 2, 2018, 9:01pm

The error points to pandas and the self.rotations.iloc[idx, 0] call in particular.
Could you print out idx and use this last idx which causes this error to call the function manually, i.e.:

ants_dataset.rotations.iloc[idx, 0]

mariosfourn · August 3, 2018, 2:12pm

I found what the issue is: torch.utils.data.random_split() returns idx as torch.Tensor rather than a float. I assume this is a bug?

ptrblck · August 3, 2018, 2:17pm

I’m not sure if that’s the issue here, as it seems your Dataset was successfully split.
Could you try to get a single sample after the splitting using:

train_dataset[0]
test_dataset[0]

Also, were you successful in getting the idx before the crash occurs?

mariosfourn · August 3, 2018, 2:40pm

I get exactly the same error, when I try that. I put an ipdb.set_trace in the custom dataset class to inspect the incoming idx. See image attached

ptrblck · August 3, 2018, 2:54pm

As a workaround, could you try using img_name = os.path.join(self.root_dir, self.rotations.iloc[idx.item(), 0])?
I’ll try to dig into this issue.

mariosfourn · August 3, 2018, 2:59pm

I’ve tried this already and it works. But If I create a DataLoader on the original dataset I will need a condition to check whether idx is a float or a tensor. Thanks for looking into it.

lkhphuc · September 19, 2018, 2:40am

I have the same problem happens in the pandas core indexing.
Type Error: object of type 'numpy.int64' has no len()
Any update on this?

praateekmahajan · June 20, 2019, 10:30pm

@ptrblck looks like @mariosfourn was correct while saying

torch.utils.data.random_split() returns idx as torch.Tensor rather than a float.

As per the example in question, indexing ants_dataset would work correctly but an error would be raised if accessing an index for train_dataset.

This could be resolved by adding idx = idx.item() but this would make indexing ants_dataset not functional.

A quick hack would be to have

    def __getitem__(self, idx):
        try:
          idx = idx.item()
        except:
          idx = idx
        return self.data.iloc[idx]

This way we’ll be able to access both ants_dataset as well as train_dataset using indices.

ptrblck · June 20, 2019, 10:38pm

As far as I remember the reason for this behavior was a potential bug in pandas.
There was also a PR which wasn’t merged due to potential performance issues.

Your workaround might work, so thanks for posting it here.

Ashima_Garg · January 2, 2020, 8:14am

import torch
from torchvision.datasets import MNIST
transform = transforms.Compose([transforms.ToTensor(), 
                                        transforms.Normalize((0.5,), (0.5,))])
dataset = MNIST(root = './data', train = train, transform = transform, download=True)
train_set, val_set = torch.utils.data.random_split(dataset, [50000, 10000])

The function of random_split to split the dataset is not working. The size of train_set and val_set returned are both 60000 which is equal to the initial dataset size.

A similar error is also reported on stack overflow:

Please look into the issue.
Thanks.

ptrblck · January 2, 2020, 8:55am

Your code snippet works for me (PyTorch 1.4.0.dev20191109, torchvision 0.5.0a0+28003e9) and yields:

print(len(train_set))
> 50000
print(len(val_set))
> 10000

Which versions are you using?

Ashima_Garg · January 2, 2020, 9:17am

torch==1.3.1
torchvision==0.4.2

Ashima_Garg · January 2, 2020, 9:32am

Hi @ptrblck, the issue is resolved. It is working for me also. I was finding the length using:

print(len(train_set.dataset))

which gives the length of the parent dataset. I wanted to convert the object of Subset class to Dataset object.
Is there a way to convert the Subset to Dataset object?

ptrblck · January 2, 2020, 9:35am

Subset wraps the Dataset for the reason to apply the specified indices to get the subset of the samples.
What is your use case that you want to revert it?
You can pass the Subset to a DataLoader, if that’s an issue.

Ashima_Garg · January 2, 2020, 9:49am

I want to fetch the entire dataset in x and y. Dataset class has .data and .targets attributes which resolves the purpose.
DataLoader with batch size set to len(dataset) will be used.

ptrblck · January 3, 2020, 7:03am

The DataLoader will use the length of the Subset, not the underlying Dataset.
If you want to fetch underlying data to process it or for some other use case, I would recommend to use it before splitting.

Could you explain your use case a bit more, so that I understand, why you need to access the internal data after splitting?

Ashima_Garg · January 5, 2020, 10:07am

@ptrblck
My use case is to first divide the dataset into two different subsets, then for each subset,
Each subset should have the __getitem__ function such that, to load a batch of samples, the __getitem__ function to return pair of samples and these pair of samples belong to the same class, i.e. batch of 4 would mean a total of 8 samples. These are paired samples belonging to the same class.

Example: from MNIST Dataset, a batch would mean (1, 1), (2, 2), (7, 7) and (9, 9).

Your post on Torch.utils.data.dataset.random_split resolves the issue of dividing the dataset into two subsets and using the __getitem__ function for the individual subsets. But Can you help with the workaround of using index in __getitem__ to return the pairs from the same class.

Thanks.

akschougule · January 19, 2020, 8:07pm

@ptrblck I had a same issue as Ashima. It seems I was checking the len(dataloader.dataset). However the dimensions still doesn’t look right. I am trying to split 200k rows into 160k of train and 40k of val.

I am not sure why I see 40k and 10k.

Sarah_W · February 27, 2020, 11:23am

Hi, I would like to split my dataset in a train and validation part, where both subset indices should be in range of 0 to len(train_data) or 0 to len(validation_data). Is there a method that I can use for this?