Weird sampling issue - SubsetRandomSampler is not shuffling my dataset

Hi everyone! I’m working on a classification problem where I have a folder with images and the label is the folder name. I am using the following script to generate loaders but when I iterate through any of the two loaders, I do not get random samples. Rather, its all samples from the first class (observed by printing the labels). The pytorch version is 0.4.1.post2 (Ubuntu).

def get_dataloaders(data_path, val_split, batch_size, shuffle=True):
    t = transforms.Compose([transforms.Resize((150,150)), 
                            transforms.ToTensor()])
    dataset = torchvision.datasets.ImageFolder(root=data_path,transform=t)

    dataset_size = len(dataset)
    indices = list(range(dataset_size))
    split = int(np.floor(val_split * dataset_size))
    
    if shuffle:
        np.random.seed(42)
        np.random.shuffle(indices)
      
    train_indices, val_indices = indices[split:], indices[:split]
    train_sampler = SubsetRandomSampler(train_indices)
    val_sampler = SubsetRandomSampler(val_indices)
        
    train_loader = DataLoader(dataset, batch_size=batch_size, sampler=train_sampler)
    val_loader = DataLoader(dataset, batch_size=batch_size, sampler=val_sampler)
    return train_loader, val_loader

train_loader, val_loader = get_dataloaders(data_path, val_split, batch_size,shuffle=True)
images, labels = next(iter(train_loader))

Am I doing this incorrectly? Strangely, on my Windows machine with pytorch version 0.4.1 all is fine and I see random samples when I print the labels.

Btw, the folder structure looks like this:

root
|_ class1
       |_ patient1
                 |_ img1.jpg  etc
|_ class2
       |_ patient1
                  |_ img1.jpg                  

Just to make sure that this is not due to an old bug, can you upgrade to PyTorch 1.0.1 to see if this problem persists?

I just updated to torch 1.0.1.post2 and this is still occurring. Could you maybe try to recreate this error on your end?

It’s kind of impossible to recreate this because the labels you return depend on the DataLoader – not sure how you implemented this.

The dataloader I use is standard from pytorch. You can import it with torch.utils.data. DataLoader.
It’s possible to give this a try with some dummy data organised with the folder structure that I mentioned. The output I see is that it is the same class in each batch of images as I iterate with next(iter(train_loader)); if you see the same class label with my dataloader script on some other dataset, then it is the same issue.

I see. I thought you were using a custom one.

Seems to work for me though:

53%20PM

Thanks so much for your efforts!
This is really strange - my setup is Ubuntu 16.04, Python3.5, torch 1.0.1.post2. Could you share your setup details? I’ve tried this on 3 other different machines with the same results x_x

Also using Ubuntu, but 18.04 and Python 3.7. I am also using torch 1.0.1.post2

Okay - I’m in the processing of creating some new environments so will update you in a few. Could you lastly share the full call to the get_dataloaders function? It seems you are not using the val_split command which is the whole point of me using SubsetRandomSampler.

I tried it on both my laptop (screenshot from above) and on my Ubuntu machine (which has the cuda version of PyTorch). It seems to work in both cases. Regarding the val_loader, just enabled it:

Oh, maybe the issue is with your dataset naming, can you try to use class_1 instead of class1 etc?

my class names are actually AD, MCI and Normal

yeah, even if I change my class names to A, B, … it doesn’t seem to make a difference. Weird. If you cannot resolve this issue, how about going about this in a more classic way by defining the class labels you want in a CSV file associated with the file names? E.g., something like this: https://github.com/rasbt/stat479-deep-learning-ss19/blob/master/L09_mlp/code/custom-dataloader/custom-dataloader-example.ipynb

hmm, this is super annoying! I think I will have to resort to making my own Dataset class like your example but since ImageFolder was built for this, I was hoping to get away with inbuilt functionality.

I HAVE SOLVED MY STUPID ISSUE!
my folder structure had an extra folder before the classes and so the entire dataset had only one label.
I’ll be killing myself now - thanks a bunch @rasbt

Haha, glad to hear that this was solved :). Means that it’s an easy fix and no bug, that’s great :slight_smile:

np.random.shuffle is done before the data splitting to create datasets with randomly shuffled input samples.
The DataLoader itself makes sure that the sample of each batch are shuffled.

1 Like

Oh. Got it, thanks! :slight_smile: