Train simultaneously on two datasets

If we want to combine two imbalanced datasets and get balanced samples, I think we could use ConcatDataset and pass a WeightedRandomSampler to the DataLoader

dataset1 = custom_dataset1()
dataset2 = custom_dataset2()
concat_dataset = torch.utils.data.ConcatDataset([dataset1, dataset2])
dataloader = torch.utils.data.DataLoader(concat_dataset, batch_size= bs, weighted_sampler)
3 Likes

I am looking for an answer for this do you have any idea about it? and thank you for your help.

Thanks a lot. Really helped me with training my CycleGAN network. :slight_smile:

Maybe we can solve this by:

class ConcatDataset(torch.utils.data.Dataset):
    def __init__(self, *datasets):
        self.datasets = datasets

    def __getitem__(self, i):
        return tuple(d[i %len(d)] for d in self.datasets)

    def __len__(self):
        return max(len(d) for d in self.datasets)

train_loader = torch.utils.data.DataLoader(
             ConcatDataset(
                 datasets.ImageFolder(traindir_A),
                 datasets.ImageFolder(traindir_B)
             ),
             batch_size=args.batch_size, shuffle=True,
             num_workers=args.workers, pin_memory=True)

for i, (input, target) in enumerate(train_loader):
    ... 
2 Likes

@GloryDream

Question #1: When I try this, it loops through the shorter dataset in the group. So if dataset A is 100 and dataset B is 1000 images and if I call ConcatDataset(dataset_A, dataset_B)[100], I’ll get a tuple with the contents filled by(dataset_A[0], dataset_B[100]). Does this make sense when putting this into a loader for training? Won’t I overfit on the smaller dataset?

Question #2: Now we don’t just have (input, target), we have ((input_1, target_1), (input_2, target_2)).

How do I train when the loader gives me a list of lists like this? Do I select randomly from the first list for my input? Or is this where weighted sampling comes in?

2 Likes

I also have the same question.Please let me know what is the best way to solve this problem. I dont think we can use weighted random sampling here if yes please let me know how can i do it?

1 Like

Hello I’m facing a similar problem and none of the solutions above are fitting. I’m running semi-supervised experiments and I’d like each batch to contain say n observations from the labelled data set set and say m observations from the unlabelled data set. Of course each of these go through different objective functions but are added together before making and optimization set. Thus I would really need to have loader formatted to sample from 2 two different data set at a time. Anyone know a ingenious to do so ?

1 Like
class BalancedConcatDataset(torch.utils.data.Dataset):
    def __init__(self, *datasets):
        self.datasets = datasets
        self.max_len = max(len(d) for d in self.datasets)
        self.min_len = min(len(d) for d in self.datasets)

    def __getitem__(self, i):
        return tuple(d[i % len(d)] for d in self.datasets)

    def masks_collate(self, batch):
        # Only image - mask
        images, masks = [], []
        for item in range(len(batch)):
            for c_dataset in range(len(batch[item])):
                images.append(batch[item][c_dataset][0])
                masks.append(batch[item][c_dataset][1])
        images = torch.stack(images)
        masks = torch.stack(masks)
        return images, masks

    def __len__(self):
        return self.max_len

It would be masks or labels

Hi @apaszke when i use this function it transforms my dataset which is combined of tensors to lists is there a solution for this ??

Any luck on a solution @MarkovChain? Currently I pass multiple datasets to CycleConcatDataset and then define a dataloader on it with a single batch size. This essentially will batch all the datasets and will cycle through the shorter ones until the longest dataset finishes.

In my use case (semi supervised and domain adaptation) I would like to keep the parameter updates as balanced as possible. This cycling method is a bit unfair as the shorter datasets update the parameters more.

I think one way to help my particular use case is to somehow use different batch sizes for each dataset.

class CycleConcatDataset(data.Dataset):
    '''Dataset wrapping multiple train datasets
    Parameters
    ----------
    *datasets : sequence of torch.utils.data.Dataset
        Datasets to be concatenated and cycled
    '''
    def __init__(self, *datasets):
        self.datasets = datasets

    def __getitem__(self, i):
        result = []
        for dataset in self.datasets:
            cycled_i = i % len(dataset)
            result.append(dataset[cycled_i])

        return tuple(result)

    def __len__(self):
        return max(len(d) for d in self.datasets)

If you are looking for using multiple dataloaders at the same time this should work


class cat_dataloaders():
    """Class to concatenate multiple dataloaders"""

    def __init__(self, dataloaders):
        self.dataloaders = dataloaders
        len(self.dataloaders)

    def __iter__(self):
        self.loader_iter = []
        for data_loader in self.dataloaders:
            self.loader_iter.append(iter(data_loader))
        return self

    def __next__(self):
        out = []
        for data_iter in self.loader_iter:
            out.append(next(data_iter)) # may raise StopIteration
        return tuple(out)

Here is a quick example

class DEBUG_dataset(Dataset):
    def __init__(self,alpha):
        self.d = (torch.arange(20) + 1) * alpha
    def __len__(self):
        return self.d.shape[0]
    def __getitem__(self, index):
        return self.d[index]

train_dl1 = DataLoader(DEBUG_dataset(10), batch_size = 4,num_workers = 0 , shuffle=True)
train_dl2 = DataLoader(DEBUG_dataset(1), batch_size = 4,num_workers = 0 , shuffle=True)
tmp = cat_dataloaders([train_dl1,train_dl2])
for x in tmp:
    print(x)

output is

(tensor([140, 160, 130,  90]), tensor([ 5, 10,  8,  9]))
(tensor([120,  30, 170,  70]), tensor([15, 17, 18,  7]))
(tensor([180,  50, 190,  80]), tensor([ 6, 14,  3,  2]))
(tensor([ 10,  40, 150, 100]), tensor([11, 13,  4,  1]))
(tensor([ 60, 200, 110,  20]), tensor([19, 12, 20, 16]))
1 Like

Bro, thanks for saving my time lol.

import numpy as np


def cycle(iterable):
    while True:
        for x in iterable:
            yield x


class MultiTaskDataloader(object):
    def __init__(self, tau=1.0, **dataloaders):
        self.dataloaders = dataloaders

        Z = sum(pow(v, tau) for v in self.dataloader_sizes.values())
        self.tasknames, self.sampling_weights = zip(*((k, pow(v, tau) / Z) for k, v in self.dataloader_sizes.items()))
        self.dataiters = {k: cycle(v) for k, v in dataloaders.items()}

    @property
    def dataloader_sizes(self):
        if not hasattr(self, '_dataloader_sizes'):
            self._dataloader_sizes = {k: len(v) for k, v in self.dataloaders.items()}
        return self._dataloader_sizes

    def __len__(self):
        return sum(v for k, v in self.dataloader_sizes.items())

    def __iter__(self):
        for i in range(len(self)):
            taskname = np.random.choice(self.tasknames, p=self.sampling_weights)
            dataiter = self.dataiters[taskname]
            batch = next(dataiter)

            batch['task'] = taskname

            yield batch

Hi,
could you provide me with how one can define distributed Sampler for the MultiTaskDataloader that @AlongWY wrote? This is basically for training a model across multiple TPU cores, where data needs to be distributed over multiple cores. thanks a lot in advance.

Hi there,
could you provide an example, in case this was not iterable dataset, but was mapping based on, how would the sampling be done? thanks

HI I found a much easier solution and wanted to share here

dataset_3 = torch.utils.data.ConcatDataset((dataset_1,dataset_2))
each of the dataset are of type torch.utils.data.dataset.Dataset

this command helped me to concatenate both the dataset and later prepare a data loader from it.
len(dataset_1)=200
len(dataset_2)=300
len(dataset_3)=500

1 Like

Thank you, it really helps.

im getting

RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

Did you forget to add the if-clause protection as explained in the error message?
If so, did its usage fix the error?
Here is a small example:

import torch

def main()
    for i, data in enumerate(dataloader):
        # do something here

if __name__ == '__main__':
    main()

yes, i have added it also…now im getting different OS Error
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\Users\test\Anaconda3\envs\py36\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies.