Dataloader on two datasets

We are writing some code to read two different datasets based on the tutorial, and thus, we will have:

train_set1, test_set1,  
train_set2, test_set2

We want to investigate each one separately, and both of them in a third experiment.
What is the best way to do this?

1 Like

The best way to deal with that would be to create two Dataset classes if the datasets are differently structured, I’d say, and re-use a single Dataset class if the datasets are similarly structured (e.g., in the typical train & test case).

Say I have downloaded the CelebA dataset. I would first make a text file with the file paths of the training samples and labels and a text file with the test samples and labels:

a) ‘celeba_gender_attr_train.txt’
b) ‘celeba_gender_attr_test.txt’

A file would look like this:

	                ClassLabel
000001.jpg	0
000002.jpg	0
000003.jpg	1
...

Then I would create a Dataset class where the “info” text file is a instantiation argument, e.g.:

class CelebaDataset(Dataset):
    """Custom Dataset for loading CelebA face images"""

    def __init__(self, txt_path, img_dir, transform=None):
    
        df = pd.read_csv(txt_path, sep=" ", index_col=0)
        self.img_dir = img_dir
        self.txt_path = txt_path
        self.img_names = df.index.values
        self.y = df['ClassLabel'].values
        self.transform = transform

    def __getitem__(self, index):
        img = Image.open(os.path.join(self.img_dir,
                                      self.img_names[index]))
        
        if self.transform is not None:
            img = self.transform(img)
        
        label = self.y[index]
        return img, label

    def __len__(self):
        return self.y.shape[0]

Then maybe adding a custom transform:

custom_transform = transforms.Compose([transforms.Grayscale(),                                       
                                       transforms.ToTensor()])

And finally I would create 2 dataset loaders from the Dataset class:

train_dataset = CelebaDataset(txt_path='celeba_gender_attr_train.txt',
                              img_dir='img_align_celeba/',
                              transform=custom_transform)

train_loader = DataLoader(dataset=test_dataset,
                          batch_size=128,
                          shuffle=True,
                          num_workers=4)

and

test_dataset = CelebaDataset(txt_path='celeba_gender_attr_test.txt',
                              img_dir='img_align_celeba/',
                              transform=custom_transform)

test_loader = DataLoader(dataset=test_dataset,
                          batch_size=128,
                          shuffle=True,
                          num_workers=4)

Then during training, you could do sth like

for epoch in range(num_epochs):
    for batch_idx, features in enumerate(train_loader):
        # train model on the training dataset
for batch_idx, features in enumerate(test_loader):
       # evaluate model on test dataset

THe test/train split is just an example. You could do the same thing for multiple training datasets and so forth

E.g.,

for epoch in range(num_epochs):
    for batch_idx, features in enumerate(train_loader_1):
        # train model on the training dataset #1
    for batch_idx, features in enumerate(train_loader_2):
        # train model on the training dataset #2
for batch_idx, features in enumerate(test_loader):
       # evaluate model on test dataset
3 Likes

Thanks for the nice and detailed explanation. I prefer to do two Dataset classes , and I do like the answer (and the idea) you are suggesting. Hence, the testing should be something like:

for batch_idx, features in enumerate(test_loader_1):
       # evaluate model on test dataset #1
       # test_loss += criterion(...)
for batch_idx, features in enumerate(test_loader_2):
       # evaluate model on test dataset #2
       # test_loss += criterion(...)

Hey you could do this :
After making two dataset classes, make a third class called fusion class that contains the two dataset classes.
The __ get __ function calls randomly or alternatively get function of the two constituiting data classes.

See this link for example of how you can do that. You can also create balance between the datasets images to load in that class.

This is probably just bit neater to do! :slight_smile:

2 Likes

Very nice.
I opted to do it this way, and it worked like a clock. In fact, one can fuse more than two datasets this way.

Hi @Naman-ntc! I would like to see your github code link but the link is dead. Can you post the code here or update the link?

FYI, just to update this topic in case someone else is looking for a good answer, there is now a ConcatDataset in Pytorch that does pretty much what the author was looking for.

3 Likes

hi @achaiah. ConcatDataset will work for the OP but what if we have 2 datasets that do not have the same number of samples? I’ve tried augmenting my smaller sampled dataset and using ConcatDataset to combine the two into one dataloader but I’m not convinced it’s doing what its supposed to. Do you have a workaround to create two dataloaders and use them for training?

ConcatDataset works well when you have 2 datasets with different number of samples.

but the unbalanced nature of the dataset created by ConcatDataset will not lead to balanced training, right? I know this is a dataset issue but I want to be able to oversample the underrepresented dataset and then use ConcatDataset or find a better way to use two datasets.

You’d better concatenate them as they are; then do sampling according to the lower size, or randomly sample, or whatever; which is way better than augmenting probably.
The alternative is to use a weighted loss.

If provided, the optional argument weight should be a 1D Tensor assigning weight to each of the classes. This is particularly useful when you have an unbalanced training set.

thanks @Deeply! Oversampling for me is done through augmentation using the transforms class. Can you elaborate what you mean on “randomly sample”? Do you know of a code example of how to randomly sample the lower size dataset?

Have a look at torch.utils.data.RandomSampler` ( data_source , replacement=False , num_samples=None )
in RandomSampler

1 Like

@nabsabs I’ve seen this linked somewhere on the forums before and tbh I don’t know of the quality of the code but you could try this imbalanced-dataset-sampler

Hi, Check the updated link here.
Although I don’t know much about ConcatDataset, feel free to weigh in your options.

Hi dear all, it seems I can’t use ConcatDataset in my problem as “train_set1” and “train_set2” have different sizes.
Do you have any idea to deal with this problem?

I’ve used ConcatDataset on different dataset sizes. For example, if you’d like to concatenate training and validation datasets you can do this:

train_set = torch.utils.data.ConcatDataset( [train_set, validate_set])

You’ll need to make sure that both datasets return the same items, else, you’ll have a problem.

Hi, Deeply, Thank you for your reply!
The problem I met is that,

If I use DataLoader later like this:

train_set = torch.utils.data.ConcatDataset( [train_set1, train_set2])
train_loader = torch.utils.data.DataLoader(train_set, batch_size=128, shuffle=True)

for i, (features, labels) in enumerate(train_loader):
    train_features =features
    train_labels = labels

there will be an error:
" RuntimeError: stack expects each tensor to be equal size".
As in traditional DataLoader, all the samples in the train_set need to have the same size.

If I modify my collate_fn function according to this post: How to create a dataloader with variable-size input. The batch of training sample data is a large List, while The input of the CNN model can only be Tensor.
So I meet another error:
"TypeError: conv1d(): argument 'input' (position 1) must be Tensor, not list"

Do you know how to solve this problem?