Multiple Datasets

I have created a dataset class and dataloader for one of my datasets:

class Visual_DataSet(Dataset):
    def __init__(self, csv_path, root_dir):
        
        self.to_tensor = transforms.ToTensor()
        self.data_info = pd.read_csv(csv_path)
        self.root_dir = root_dir
        self.image_arr = np.asarray(self.data_info.iloc[:, 0])
        self.label_arr = np.asarray(self.data_info.iloc[:, 2])
        self.data_len = len(self.data_info.index)

    def __getitem__(self, index):
        single_image_name = self.image_arr[index]
        img_as_img = Image.open(self.root_dir+single_image_name)
        img_as_tensor = self.to_tensor(img_as_img)
        single_image_label = self.label_arr[index]
        return (img_as_tensor, single_image_label)

    def __len__(self):
        return self.data_len

if __name__ == "__main__":
    vs_trainset =  \
        DataSet(csv_path = 'images/visible/train_batch/trainset_batch.csv',
               root_dir = 'images/visible/train_batch/')
    vs_testset = \
        DataSet(csv_path = 'images/visible/test_batch/trainset_batch.csv',
               root_dir = 'images/visible/test_batch/')
trainloader = torch.utils.data.DataLoader(dataset=fused_data,
                                                batch_size=4,
                                                shuffle=True)
testloader = torch.utils.data.DataLoader(dataset=fused_data,
                                                batch_size=4,
                                                shuffle=False)

I want to add another dataset (i am trying to fuse the datasets). So I have created another class for my other dataset in the same manner as the first previous one. Is there any way that I have put the two datasets together in a single dataloader? Thanks in advance.

You could provide all Datasets as a sequence to ConcatDataset to create a single one, which you can pass to the DataLoader.

Thanks for the quick reply. I tried that previously:

class ConcatDataset(Dataset):
    
    def __init__(self, datasets):
        super(ConcatDataset, self).__init__()
        assert len(datasets) > 0, 'datasets should not be an empty iterable'
        self.datasets = list(datasets)
        self.cumulative_sizes = self.cumsum(self.datasets)

    def __len__(self):
        return self.cumulative_sizes[-1]

    def __getitem__(self, idx):
        dataset_idx = bisect.bisect_right(self.cumulative_sizes, idx)
        if dataset_idx == 0:
            sample_idx = idx
        else:
            sample_idx = idx - self.cumulative_sizes[dataset_idx - 1]
        return self.datasets[dataset_idx][sample_idx]
fused_data = torch.utils.data.ConcatDataset(vs_trainset, th_trainset)

But I kept getting the error:

TypeError: __init__() takes 2 positional arguments but 3 were given

Can you see any obvious error that I may have overlooked?

Pass the Datasets as a list or tuple and it should work.

It still gives me the same error:

fused_trainset = torch.utils.data.ConcatDataset([vs_trainset], [th_trainset])
TypeError: __init__() takes 2 positional arguments but 3 were given

Sorry for not being clear enough. You should pass one list containing all Datasets:

fused_trainset = torch.utils.data.ConcatDataset([vs_trainset, th_trainset])

Thank you so very much. It works now.

Hi ptrblck~

I met a similar but more complicated scenario: I got two datasets A and B, and the batch_size_A was 30 while the batch_size_B was 60. ConcatDataset can only deal with the case where A and B are equivalent. So is there any official for imbalanced multi dataset sampling?

If you want to use different batch sizes for the datasets, you could either use different DataLoaders or probably create a custom sampler, which would use the predefined sample indices given your batch sizes.

I have built a custom Dataset class for loading COCO instance segmentation datasets. ConcatDataset is not available when inheriting from the abstract class Dataset but I have the need to merge multiple datasets. Any help would be appreciated.

My custom dataset is based off of this example.