Dataloader on two datasets

Deeply · May 22, 2018, 11:39am

We are writing some code to read two different datasets based on the tutorial, and thus, we will have:

train_set1, test_set1,  
train_set2, test_set2

We want to investigate each one separately, and both of them in a third experiment.
What is the best way to do this?

rasbt · May 22, 2018, 12:38pm

The best way to deal with that would be to create two Dataset classes if the datasets are differently structured, I’d say, and re-use a single Dataset class if the datasets are similarly structured (e.g., in the typical train & test case).

Say I have downloaded the CelebA dataset. I would first make a text file with the file paths of the training samples and labels and a text file with the test samples and labels:

a) ‘celeba_gender_attr_train.txt’
b) ‘celeba_gender_attr_test.txt’

A file would look like this:

	                ClassLabel
000001.jpg	0
000002.jpg	0
000003.jpg	1
...

Then I would create a Dataset class where the “info” text file is a instantiation argument, e.g.:

class CelebaDataset(Dataset):
    """Custom Dataset for loading CelebA face images"""

    def __init__(self, txt_path, img_dir, transform=None):
    
        df = pd.read_csv(txt_path, sep=" ", index_col=0)
        self.img_dir = img_dir
        self.txt_path = txt_path
        self.img_names = df.index.values
        self.y = df['ClassLabel'].values
        self.transform = transform

    def __getitem__(self, index):
        img = Image.open(os.path.join(self.img_dir,
                                      self.img_names[index]))
        
        if self.transform is not None:
            img = self.transform(img)
        
        label = self.y[index]
        return img, label

    def __len__(self):
        return self.y.shape[0]

Then maybe adding a custom transform:

custom_transform = transforms.Compose([transforms.Grayscale(),                                       
                                       transforms.ToTensor()])

And finally I would create 2 dataset loaders from the Dataset class:

train_dataset = CelebaDataset(txt_path='celeba_gender_attr_train.txt',
                              img_dir='img_align_celeba/',
                              transform=custom_transform)

train_loader = DataLoader(dataset=test_dataset,
                          batch_size=128,
                          shuffle=True,
                          num_workers=4)

and

test_dataset = CelebaDataset(txt_path='celeba_gender_attr_test.txt',
                              img_dir='img_align_celeba/',
                              transform=custom_transform)

test_loader = DataLoader(dataset=test_dataset,
                          batch_size=128,
                          shuffle=True,
                          num_workers=4)

Then during training, you could do sth like

for epoch in range(num_epochs):
    for batch_idx, features in enumerate(train_loader):
        # train model on the training dataset
for batch_idx, features in enumerate(test_loader):
       # evaluate model on test dataset

THe test/train split is just an example. You could do the same thing for multiple training datasets and so forth

E.g.,

for epoch in range(num_epochs):
    for batch_idx, features in enumerate(train_loader_1):
        # train model on the training dataset #1
    for batch_idx, features in enumerate(train_loader_2):
        # train model on the training dataset #2
for batch_idx, features in enumerate(test_loader):
       # evaluate model on test dataset

Deeply · May 22, 2018, 12:59pm

Thanks for the nice and detailed explanation. I prefer to do two Dataset classes , and I do like the answer (and the idea) you are suggesting. Hence, the testing should be something like:

for batch_idx, features in enumerate(test_loader_1):
       # evaluate model on test dataset #1
       # test_loss += criterion(...)
for batch_idx, features in enumerate(test_loader_2):
       # evaluate model on test dataset #2
       # test_loss += criterion(...)

Naman-ntc · May 22, 2018, 2:09pm

Hey you could do this :
After making two dataset classes, make a third class called fusion class that contains the two dataset classes.
The __ get __ function calls randomly or alternatively get function of the two constituiting data classes.

See this link for example of how you can do that. You can also create balance between the datasets images to load in that class.

This is probably just bit neater to do!

Deeply · July 10, 2018, 4:18pm

Very nice.
I opted to do it this way, and it worked like a clock. In fact, one can fuse more than two datasets this way.

nabsabs · March 13, 2019, 11:35pm

Hi @Naman-ntc! I would like to see your github code link but the link is dead. Can you post the code here or update the link?

achaiah · March 14, 2019, 7:36pm

FYI, just to update this topic in case someone else is looking for a good answer, there is now a ConcatDataset in Pytorch that does pretty much what the author was looking for.

nabsabs · March 15, 2019, 4:01pm

hi @achaiah. ConcatDataset will work for the OP but what if we have 2 datasets that do not have the same number of samples? I’ve tried augmenting my smaller sampled dataset and using ConcatDataset to combine the two into one dataloader but I’m not convinced it’s doing what its supposed to. Do you have a workaround to create two dataloaders and use them for training?

Deeply · March 15, 2019, 5:37pm

ConcatDataset works well when you have 2 datasets with different number of samples.

nabsabs · March 15, 2019, 6:07pm

but the unbalanced nature of the dataset created by ConcatDataset will not lead to balanced training, right? I know this is a dataset issue but I want to be able to oversample the underrepresented dataset and then use ConcatDataset or find a better way to use two datasets.

Deeply · March 15, 2019, 7:11pm

You’d better concatenate them as they are; then do sampling according to the lower size, or randomly sample, or whatever; which is way better than augmenting probably.
The alternative is to use a weighted loss.

If provided, the optional argument weight should be a 1D Tensor assigning weight to each of the classes. This is particularly useful when you have an unbalanced training set.

nabsabs · March 15, 2019, 7:39pm

thanks @Deeply! Oversampling for me is done through augmentation using the transforms class. Can you elaborate what you mean on “randomly sample”? Do you know of a code example of how to randomly sample the lower size dataset?

Deeply · March 15, 2019, 9:01pm

Have a look at torch.utils.data.RandomSampler` ( data_source , replacement=False , num_samples=None )
in RandomSampler

achaiah · March 16, 2019, 3:58am

@nabsabs I’ve seen this linked somewhere on the forums before and tbh I don’t know of the quality of the code but you could try this imbalanced-dataset-sampler

Naman-ntc · March 19, 2019, 4:20am

Hi, Check the updated link here.
Although I don’t know much about ConcatDataset, feel free to weigh in your options.

lurenyi233 · February 17, 2021, 8:44pm

Hi dear all, it seems I can’t use ConcatDataset in my problem as “train_set1” and “train_set2” have different sizes.
Do you have any idea to deal with this problem?

Deeply · February 17, 2021, 10:35pm

I’ve used ConcatDataset on different dataset sizes. For example, if you’d like to concatenate training and validation datasets you can do this:

train_set = torch.utils.data.ConcatDataset( [train_set, validate_set])

You’ll need to make sure that both datasets return the same items, else, you’ll have a problem.

lurenyi233 · February 17, 2021, 10:59pm

Hi, Deeply, Thank you for your reply!
The problem I met is that,

If I use DataLoader later like this:

train_set = torch.utils.data.ConcatDataset( [train_set1, train_set2])
train_loader = torch.utils.data.DataLoader(train_set, batch_size=128, shuffle=True)

for i, (features, labels) in enumerate(train_loader):
    train_features =features
    train_labels = labels

there will be an error:
" RuntimeError: stack expects each tensor to be equal size".
As in traditional DataLoader, all the samples in the train_set need to have the same size.

If I modify my collate_fn function according to this post: How to create a dataloader with variable-size input. The batch of training sample data is a large List, while The input of the CNN model can only be Tensor.
So I meet another error:
"TypeError: conv1d(): argument 'input' (position 1) must be Tensor, not list"

Do you know how to solve this problem?