Weird sampling issue - SubsetRandomSampler is not shuffling my dataset

nabsabs · April 18, 2019, 12:33am

Hi everyone! I’m working on a classification problem where I have a folder with images and the label is the folder name. I am using the following script to generate loaders but when I iterate through any of the two loaders, I do not get random samples. Rather, its all samples from the first class (observed by printing the labels). The pytorch version is 0.4.1.post2 (Ubuntu).

def get_dataloaders(data_path, val_split, batch_size, shuffle=True):
    t = transforms.Compose([transforms.Resize((150,150)), 
                            transforms.ToTensor()])
    dataset = torchvision.datasets.ImageFolder(root=data_path,transform=t)

    dataset_size = len(dataset)
    indices = list(range(dataset_size))
    split = int(np.floor(val_split * dataset_size))
    
    if shuffle:
        np.random.seed(42)
        np.random.shuffle(indices)
      
    train_indices, val_indices = indices[split:], indices[:split]
    train_sampler = SubsetRandomSampler(train_indices)
    val_sampler = SubsetRandomSampler(val_indices)
        
    train_loader = DataLoader(dataset, batch_size=batch_size, sampler=train_sampler)
    val_loader = DataLoader(dataset, batch_size=batch_size, sampler=val_sampler)
    return train_loader, val_loader

train_loader, val_loader = get_dataloaders(data_path, val_split, batch_size,shuffle=True)
images, labels = next(iter(train_loader))

Am I doing this incorrectly? Strangely, on my Windows machine with pytorch version 0.4.1 all is fine and I see random samples when I print the labels.

Btw, the folder structure looks like this:

root
|_ class1
       |_ patient1
                 |_ img1.jpg  etc
|_ class2
       |_ patient1
                  |_ img1.jpg

rasbt · April 18, 2019, 2:13am

Just to make sure that this is not due to an old bug, can you upgrade to PyTorch 1.0.1 to see if this problem persists?

nabsabs · April 18, 2019, 12:13pm

I just updated to torch 1.0.1.post2 and this is still occurring. Could you maybe try to recreate this error on your end?

rasbt · April 18, 2019, 4:41pm

It’s kind of impossible to recreate this because the labels you return depend on the DataLoader – not sure how you implemented this.

nabsabs · April 18, 2019, 4:59pm

The dataloader I use is standard from pytorch. You can import it with torch.utils.data. DataLoader.
It’s possible to give this a try with some dummy data organised with the folder structure that I mentioned. The output I see is that it is the same class in each batch of images as I iterate with next(iter(train_loader)); if you see the same class label with my dataloader script on some other dataset, then it is the same issue.

rasbt · April 18, 2019, 5:02pm

I see. I thought you were using a custom one.

rasbt · April 18, 2019, 5:12pm

Seems to work for me though:

53%20PM

nabsabs · April 18, 2019, 5:15pm

Thanks so much for your efforts!
This is really strange - my setup is Ubuntu 16.04, Python3.5, torch 1.0.1.post2. Could you share your setup details? I’ve tried this on 3 other different machines with the same results x_x

rasbt · April 18, 2019, 5:20pm

Also using Ubuntu, but 18.04 and Python 3.7. I am also using torch 1.0.1.post2

nabsabs · April 18, 2019, 5:24pm

Okay - I’m in the processing of creating some new environments so will update you in a few. Could you lastly share the full call to the get_dataloaders function? It seems you are not using the val_split command which is the whole point of me using SubsetRandomSampler.

rasbt · April 18, 2019, 5:30pm

I tried it on both my laptop (screenshot from above) and on my Ubuntu machine (which has the cuda version of PyTorch). It seems to work in both cases. Regarding the val_loader, just enabled it:

rasbt · April 18, 2019, 5:31pm

Oh, maybe the issue is with your dataset naming, can you try to use class_1 instead of class1 etc?

nabsabs · April 18, 2019, 5:32pm

my class names are actually AD, MCI and Normal

rasbt · April 18, 2019, 5:34pm

yeah, even if I change my class names to A, B, … it doesn’t seem to make a difference. Weird. If you cannot resolve this issue, how about going about this in a more classic way by defining the class labels you want in a CSV file associated with the file names? E.g., something like this: https://github.com/rasbt/stat479-deep-learning-ss19/blob/master/L09_mlp/code/custom-dataloader/custom-dataloader-example.ipynb

nabsabs · April 18, 2019, 5:39pm

hmm, this is super annoying! I think I will have to resort to making my own Dataset class like your example but since ImageFolder was built for this, I was hoping to get away with inbuilt functionality.

nabsabs · April 18, 2019, 6:36pm

I HAVE SOLVED MY STUPID ISSUE!
my folder structure had an extra folder before the classes and so the entire dataset had only one label.
I’ll be killing myself now - thanks a bunch @rasbt

rasbt · April 18, 2019, 7:54pm

Haha, glad to hear that this was solved :). Means that it’s an easy fix and no bug, that’s great

ptrblck · March 12, 2020, 5:55am

np.random.shuffle is done before the data splitting to create datasets with randomly shuffled input samples.
The DataLoader itself makes sure that the sample of each batch are shuffled.

theairbend3r · March 12, 2020, 10:33am

Oh. Got it, thanks!