Dataset with dirty and clean data separate


I’m training an auto encoder network to remove “dirt” from images. For this I have two folders: dirty and clean. Currently I load the data like this

dirty_data = torchvision.datasets.ImageFolder(root='data/dirty', transform=transform)
clean_data = torchvision.datasets.ImageFolder(root='data/clean', transform=transform)

train_dirty_loader =, batch_size=BATCH_SIZE, num_workers=0, shuffle=False)
train_clean_loader =, batch_size=BATCH_SIZE, num_workers=0, shuffle=False)

There are images in the dirty and clean folders with the same name (same images, with and without “dirt”). This method of having two separate loaders does work, but comes with a number of issues

  • It’s ugly
  • I can’t use shuffle=True, since this would make train_dirty_loader and train_clean_loader out of “sync” with each other (training depends on that the clean and dirty images comes in correct order now).
  • I can’t split the dataset using the random_split function, for same reason as above.

How should I solve this?

You could create your own Dataloader class where each data point contains both the clean and dirty image?