Multiple Datasets

ahmed · February 11, 2019, 9:23pm

I have created a dataset class and dataloader for one of my datasets:

class Visual_DataSet(Dataset):
    def __init__(self, csv_path, root_dir):
        
        self.to_tensor = transforms.ToTensor()
        self.data_info = pd.read_csv(csv_path)
        self.root_dir = root_dir
        self.image_arr = np.asarray(self.data_info.iloc[:, 0])
        self.label_arr = np.asarray(self.data_info.iloc[:, 2])
        self.data_len = len(self.data_info.index)

    def __getitem__(self, index):
        single_image_name = self.image_arr[index]
        img_as_img = Image.open(self.root_dir+single_image_name)
        img_as_tensor = self.to_tensor(img_as_img)
        single_image_label = self.label_arr[index]
        return (img_as_tensor, single_image_label)

    def __len__(self):
        return self.data_len

if __name__ == "__main__":
    vs_trainset =  \
        DataSet(csv_path = 'images/visible/train_batch/trainset_batch.csv',
               root_dir = 'images/visible/train_batch/')
    vs_testset = \
        DataSet(csv_path = 'images/visible/test_batch/trainset_batch.csv',
               root_dir = 'images/visible/test_batch/')

trainloader = torch.utils.data.DataLoader(dataset=fused_data,
                                                batch_size=4,
                                                shuffle=True)
testloader = torch.utils.data.DataLoader(dataset=fused_data,
                                                batch_size=4,
                                                shuffle=False)

I want to add another dataset (i am trying to fuse the datasets). So I have created another class for my other dataset in the same manner as the first previous one. Is there any way that I have put the two datasets together in a single dataloader? Thanks in advance.

ptrblck · February 11, 2019, 9:25pm

You could provide all Datasets as a sequence to ConcatDataset to create a single one, which you can pass to the DataLoader.

ahmed · February 12, 2019, 11:15am

Thanks for the quick reply. I tried that previously:

class ConcatDataset(Dataset):
    
    def __init__(self, datasets):
        super(ConcatDataset, self).__init__()
        assert len(datasets) > 0, 'datasets should not be an empty iterable'
        self.datasets = list(datasets)
        self.cumulative_sizes = self.cumsum(self.datasets)

    def __len__(self):
        return self.cumulative_sizes[-1]

    def __getitem__(self, idx):
        dataset_idx = bisect.bisect_right(self.cumulative_sizes, idx)
        if dataset_idx == 0:
            sample_idx = idx
        else:
            sample_idx = idx - self.cumulative_sizes[dataset_idx - 1]
        return self.datasets[dataset_idx][sample_idx]

fused_data = torch.utils.data.ConcatDataset(vs_trainset, th_trainset)

But I kept getting the error:

TypeError: __init__() takes 2 positional arguments but 3 were given

Can you see any obvious error that I may have overlooked?

ptrblck · February 12, 2019, 11:27am

Pass the Datasets as a list or tuple and it should work.

ahmed · February 12, 2019, 11:33am

It still gives me the same error:

fused_trainset = torch.utils.data.ConcatDataset([vs_trainset], [th_trainset])

TypeError: __init__() takes 2 positional arguments but 3 were given

ptrblck · February 12, 2019, 11:35am

Sorry for not being clear enough. You should pass one list containing all Datasets:

fused_trainset = torch.utils.data.ConcatDataset([vs_trainset, th_trainset])

ahmed · February 12, 2019, 11:36am

Thank you so very much. It works now.

Kylin9511 · September 20, 2020, 6:46am

Hi ptrblck~

I met a similar but more complicated scenario: I got two datasets A and B, and the batch_size_A was 30 while the batch_size_B was 60. ConcatDataset can only deal with the case where A and B are equivalent. So is there any official for imbalanced multi dataset sampling?

ptrblck · September 21, 2020, 5:46am

If you want to use different batch sizes for the datasets, you could either use different DataLoaders or probably create a custom sampler, which would use the predefined sample indices given your batch sizes.

bigswede74 · September 23, 2020, 5:38pm

I have built a custom Dataset class for loading COCO instance segmentation datasets. ConcatDataset is not available when inheriting from the abstract class Dataset but I have the need to merge multiple datasets. Any help would be appreciated.

My custom dataset is based off of this example.