How ConcatDataset work?

Nikolaos_Sintoris · May 4, 2021, 9:19am

Hello everyone.

I have 3 folders (for example A, B, C) and then every folder has 3 subfolders with the same name (for example sub1, sub2, sub3). Every subfolder contains images and basically every subfolder represents a different class.

I want to create a training data set with these 3 folders and I found ConcatDataset that might work.

So first of all I created 3 different train_datasets using ImageFolder and then I created a list with these 3 train_datasets (lets say that this list has the name all_datasets).
Then in order to create my final training dataset, i used concatdataset

Here is the code:

A_dataset = torchvision.datasets.ImageFolder(root = A_directory , transform = transform)
B_dataset = torchvision.datasets.ImageFolder(root = B_directory , transform = transform)
C_dataset = torchvision.datasets.ImageFolder(root = C_directory , transform = transform)

all_datasets = []
all_datasets.append(A_dataset)
all_datasets.append(B_dataset)
all_datasets.append(C_dataset)

final_training_dataset = torch.utils.data.ConcatDataset(all_datasets)

Can anyone explain to me how the format of final_training_dataset is?
I am afraid that it will confuse my classes and thus, I will not have the right labels.
Is there a problem or is everything fine?

Thank you everyone.

ariG23498 · May 4, 2021, 10:31am

Hey @Nikolaos_Sintoris
I tried with a simple implementation of the code you have provided.

The ConcatDataset has a .datasets attribute that lists all the datasets that are concatenated. This helps in the demarcation of all the labels. You would need to iterate on all the datasets in the ConcatDataset each having its own set of labels.

Hope this helps you.

Code to reproduce

$ mkdir A B C
$ mkdir A/sub_1 A/sub_2 A/sub_3
$ mkdir B/sub_1 B/sub_2 B/sub_3
$ mkdir C/sub_1 C/sub_2 C/sub_3

Creating random images in each sub folder

for folder in ["A", "B", "C"]:
    for sub_folder in os.listdir(folder):
        for i in range(10):
            img = np.random.random((20,20))
            plt.imsave(arr=img, fname=f"{folder}/{sub_folder}/img_{i}.png")

Using the code snippet provided by you

A_dataset = torchvision.datasets.ImageFolder(root = "A" , transform = torchvision.transforms.ToTensor())
B_dataset = torchvision.datasets.ImageFolder(root = "B" , transform = torchvision.transforms.ToTensor())
C_dataset = torchvision.datasets.ImageFolder(root = "C" , transform = torchvision.transforms.ToTensor())

all_datasets = []
all_datasets.append(A_dataset)
all_datasets.append(B_dataset)
all_datasets.append(C_dataset)

final_training_dataset = torch.utils.data.ConcatDataset(all_datasets)

The sanity check

for dataset in final_training_dataset.datasets:
    print(dataset.root)
    print(dataset.classes)

Result:

A
['sub_1', 'sub_2', 'sub_3']
B
['sub_1', 'sub_2', 'sub_3']
C
['sub_1', 'sub_2', 'sub_3']

Nikolaos_Sintoris · May 4, 2021, 10:49am

Ok, so if I create a dataloader from the final_training_dataset I am ok.

trainloader = torch.utils.data.DataLoader(final_training_dataset, batch_size = 1, shuffle = True)  

for image, label in trainloader:
      #process my images

Am I right?

ariG23498 · May 4, 2021, 12:37pm

With the trainloader, the labels would only correspond to 0, 1 and 2. This will create a problem, you would not be able to distinguish from which subfolder the image comes from.

To iterate upon the problem, you have three folders and three sub-folders in each of them. This means that you have in total 3*3=9 individual classes.

I would suggest you to do the following:

Make sure the classes of every <>_dataset is distinct.

for ind, c in enumerate(A_dataset.classes):
    A_dataset.classes[ind] = f"A_{c}"

for ind, c in enumerate(B_dataset.classes):
    B_dataset.classes[ind] = f"B_{c}"

for ind, c in enumerate(C_dataset.classes):
    C_dataset.classes[ind] = f"C_{c}"

ConcatDataset holds the order in which the datasets are provided unless the shuffle=True parameter is present. If you keep a tab of the counts of data you could see from which dataset the data comes from.

A minimal implementation can be seen here in this GitHub gist.

If the gist does not load, please use nbviewer

Nikolaos_Sintoris · May 4, 2021, 3:50pm

Yes but fortunately i do have only 3 classes. I mean that my subfolders represent the same 3 classes in every folder. So it is not a problem that i will only have 0, 1, 2 labels. I just wanted to make sure that in folder A, subfolder1 has label 0, subfolder2 has label 1 and subfolder3 has label 2. The same thing happens for the other folders (for example in folder B, subfolder1 has label 0, subfolder2 has label 1 etc).
So eventually i do not have problem with trainloader right? It will not confuse the labels. In every folder, always subfolder1 will have label 0, right?

ariG23498 · May 4, 2021, 5:09pm

Yes. Then it should not be any problem.

Nikolaos_Sintoris · May 6, 2021, 7:26am

Thanks for your help!