Will ConcatDataset make sure that images from different datasets don't overlap when using Dataloader

zimmer550 · November 12, 2019, 4:01pm

So, let’s say I grab images from two folders. Now, I concat them and pass the final dataset to the dataloader. Dataset1 has 100 images and Dataset2 has 100 images as well. My batch size is 7. Now, ideally I want 14 image batches from the first Dataset1 and 14 image batches from the second Dataset2 and, so, a total of 28 batches.

So, the first 14 batches of the 28 batches contain images from Dataset1 only and the next 14 batches contain images from Dataset2 only. Will ConcatDataset make sure that is the case?

JuanFMontesinos · November 12, 2019, 5:59pm

It does not work like that.
Each dataset has a length.
Concat dataset will generate a dataset of len1+len2.
It will shuffle randomly indices from both datasets and will generate a batch. If idx belongs to dataset1 it will call dataset1.getitem and so on.

You can expect that, statistically talking, each batch has half and half as it comes from a random distribution and len1 = len2 but it won’t ensure that.

zimmer550 · November 12, 2019, 7:29pm

Thank you, that really explains the bad behavior of my neural net. So, the obvious follow-up question would be is there a way to get my desired behavior? I was thinking concatenating DataLoaders instead since the DataLoader for each Dataset would contain images from that particular dataset online batch-wise. After concatenating I wouldn’t have to worry about a mix-up. Anything like that possible in Pytorch?

JuanFMontesinos · November 12, 2019, 7:37pm

No as far as I know but you can create a new dataset class which contains both datasets as an object.
Then, just return 2 tensors concatenated in an arbitrary dimension and reshape it in the batch, something like

class BigDataset()
  dataset1 =  dataset()
  dataset2 = dataset()
  assert len(dataset1) == len(dataset2)
  __len__
  return len(dataset1)
 __getitem__
  return torch.stack([dataset1.__getitem__(idx),dataset2.__getitem__(idx)]


for i in dataloader():
  i = i.view(-1,dimensions)

zimmer550 · November 12, 2019, 7:41pm

Wait why should the length of both datasets be the same? Actually, I hope you don’t mind, it’s very hard to understand this code. Can you maybe make it a bit more formal so I can get a good idea of what you are trying to do?

JuanFMontesinos · November 12, 2019, 7:57pm

Well you said dataset1 has 100 images and dataset2 has 100 images xd

import torch
from torch.utils.data import Dataset


class ImgDataset(Dataset):
    def __init__(self, *args):
        do_your_stuf(*args)

    def __len__(self):
        return ''

    def __getitem__(self, idx):
        return load_img()


class PairDataset(Dataset):
    def __init__(self, dataset1, dataset2):
        self.dataset1 = dataset1
        self.dataset2 = dataset2

    def __len__(self):
        # Choose a criteria, assumen len1 = len2
        return len(self.dataset1)

    def __getitem__(self, idx):
        inputs1, inputs2 = self.dataset1[idx], self.dataset2[idx]

        return [torch.stack([x, y], dim=1) for x, y in
                zip(inputs1, inputs2)]  # This stacks all the tensors retourned by dataset


def train()
    dataset = PairDataset(ImgDataset(*args1), ImgDataset(*args2))
    dataloader = ...(dataset)

    for tensors in dataloader:
        tensors = [x.view(-1, *x.shape[2:]) for x in tensors] # You are converting Batch,[im1,im2],... into 2*Batch,...

If dataset1 and dataset 2 has different lengths you will have to fix it somehow (maybe repeating samples or whatever)

If you need to shuffle you can create a fake list of indices over the real one. Like:

class ImgDataset(Dataset):
    def __init__(self, *args):
        do_your_stuf(*args)
        self.indices = list(range(len(self)))
        shuffle(self.indices)
    def __len__(self):
        return ''

    def __getitem__(self, idx):
        real_idx = self.indices[idx]
        return load_img(real_idx)