How does ConcatDataset work?

zimmer550 · November 5, 2019, 3:21pm

Hello. This is my CustomDataSetClass:

class CustomDataSet(Dataset):
    def __init__(self, main_dir, transform):
        self.main_dir = main_dir
        self.transform = transform
        all_imgs = os.listdir(main_dir)
        self.total_imgs = natsort.natsorted(all_imgs)
        for file_name in self.total_imgs:
            if '.txt' in file_name: self.total_imgs.remove(file_name)
            if file_name == 'semantic': self.total_imgs.remove('semantic')

    def __len__(self):
        return len(self.total_imgs)

    def __getitem__(self, idx):
        img_loc = os.path.join(self.main_dir, self.total_imgs[idx])
        image = Image.open(img_loc).convert("RGB")
        tensor_image = self.transform(image)
        return tensor_image

Here is how I create a list of datasets:

    all_datasets = []
    while folder_counter < num_train_folders:
        #some code to get path_to_imgs which is the location of the image folder
        train_dataset = CustomDataSet(path_to_imgs, transform)
        all_datasets.append(train_dataset)
        folder_counter += 1

Then I concat my datasets and create the dataloader and do the training:

final_dataset = torch.utils.data.ConcatDataset(all_datasets)
train_loader = data.DataLoader(final_dataset,
                                   batch_size=batch_size,
                                   shuffle=False,
                                   num_workers=0,
                                   pin_memory=True,
                                   drop_last=True)

So, is the order of my data preserved? During training, will I go to each folder in theexact order that the concatenation was done and then grab all the images sequentially? For example:

I grab 150 images from folder 1, 100 images from folder 2 and 70 images from folder 3. I concatenate my the three datasets. During training I do:

for idx, input_seq in enumerate(data_loader):
#code to train

So, will the dataloader go through folder 1 and grab all the images inside there sequentially and then go to folder 2 and do the same and finally go to folder 3 and do the same as well? I tried reading the code for ConcatDataset but I can’t understand whether the order of my data willl be preserved or not.

ptrblck · November 6, 2019, 4:53am

Yes, the order should be preserved as shown in this simple example using TensorDatasets:

datasets = []
for i in range(3):
    datasets.append(TensorDataset(torch.arange(i*10, (i+1)*10)))

dataset = ConcatDataset(datasets)
loader = DataLoader(
    dataset,
    shuffle=False,
    num_workers=0,
    batch_size=2
)

for data in loader:
    print(data)

mohammed_guermal · July 7, 2020, 9:56am

what if I want to grab data from different file but after i want it to be concatinated and shuffled not preserving the same order
because I have data in different folders so i grabe each of them but afterwards i want it to be all shuffled.

ptrblck · July 8, 2020, 2:56am

You could use shuffle=True when creating the DataLoader, which will shuffle the passed ConcatDataset.

nisharaichur · October 22, 2020, 9:53am

list_1 = [1,2,3,4,5]
list_2 = [6,7,8,9,10]
list_3 = [22,23,24,25,26,27]
dataset_list = [list_1, list_2, list_3]

dataset_loader = DataLoader(dataset_list, shuffle=True, batch_size=3)
for i in range(30):
    for x in dataset_loader:
        print(x)

My question is why does each batch(size 3) has same data but shuffled within themselves?
My usecase is, I need to shuffle the entire dataset after concatenation, such that over each epoch I have different batch of dataset shuffled over all the datasets[list_1, list_2, list_3]

ptrblck · October 22, 2020, 10:44am

Passing nested lists to the DataLoader might have these kind of side effects and thus I would recommend to create tensors, pass them to a TensorDataset, and this dataset to the DataLoader, which should then properly index and shuffle the data.

nisharaichur · October 22, 2020, 11:06am

Sure Thanks

adwaykanhere · May 12, 2022, 8:01pm

Hi,
I’m trying to use ConcatDataset to concat the training and testing sets of the CIFAR10 dataset.
For my application, I also need to combine the dataset.target values for each set. But from my understanding, I cannot derive combined targets using ConcatDataset.
Please suggest how to go about concatenating the datasets such that I can get targets as well.

ptrblck · May 17, 2022, 4:30am

Double post from here.

Brando_Miranda · September 26, 2022, 7:07pm

hmmm but for real data sets each of them will have their own data transforms e.g. their own mean, std to normalize the images. How do you deal with that? I’d assume this is dealt at the each data set level by giving them the right transforms when instantiating the data set pytorch object.

I think this: Concat image datasets with different size and number of channels answers my question. Which my suggestion was right.

Brando_Miranda · September 27, 2022, 1:26am

how did you solve your target labels issue? @adwaykanhere

this is key: Does Concatenate Datasets preserve class labels and indices - #3 by ptrblck

easiest solution to what I want is to do use this: Does Concatenate Datasets preserve class labels and indices - #12 by Brando_Miranda by using learn2learn’s union of data sets.