Train simultaneously on two datasets

vishalthengane · March 19, 2019, 4:00am

Hi,

I am trying to concatenate dataset in a such a way that which will also able to return path.

MariosOreo · March 19, 2019, 9:54am

Hi,

I write a simple demo for you, just use tensor_data, you can have a modification on it to meet your needs.

class custom_dataset1(torch.utils.data.Dataset):
    def __init__(self):
        super(custom_dataset1, self).__init__()
        self.tensor_data = torch.tensor([1., 2., 3., 4., 5.])
    def __getitem__(self, index):
        return self.tensor_data[index], index
    def __len__(self):
        return len(self.tensor_data)

class custom_dataset2(torch.utils.data.Dataset):
    def __init__(self):
        super(custom_dataset2, self).__init__()
        self.tensor_data = torch.tensor([6., 7., 8., 9., 10.])
    def __getitem__(self, index):
        return self.tensor_data[index], index
    def __len__(self):
        return len(self.tensor_data)

dataset1 = custom_dataset1()
dataset2 = custom_dataset2()
concate_dataset = torch.utils.data.ConcatDataset([dataset1, dataset2])
value ,index = next(iter(concate_dataset))
print(value, index)

you can change index in to path, then using corresponding loss function.

MariosOreo · March 19, 2019, 10:11am

If we want to combine two imbalanced datasets and get balanced samples, I think we could use ConcatDataset and pass a WeightedRandomSampler to the DataLoader

dataset1 = custom_dataset1()
dataset2 = custom_dataset2()
concat_dataset = torch.utils.data.ConcatDataset([dataset1, dataset2])
dataloader = torch.utils.data.DataLoader(concat_dataset, batch_size= bs, weighted_sampler)

vishalthengane · March 19, 2019, 11:47am

I am looking for an answer for this do you have any idea about it? and thank you for your help.

aniket · April 4, 2019, 6:29pm

Thanks a lot. Really helped me with training my CycleGAN network.

GloryDream · November 4, 2019, 7:45am

Maybe we can solve this by:

class ConcatDataset(torch.utils.data.Dataset):
    def __init__(self, *datasets):
        self.datasets = datasets

    def __getitem__(self, i):
        return tuple(d[i %len(d)] for d in self.datasets)

    def __len__(self):
        return max(len(d) for d in self.datasets)

train_loader = torch.utils.data.DataLoader(
             ConcatDataset(
                 datasets.ImageFolder(traindir_A),
                 datasets.ImageFolder(traindir_B)
             ),
             batch_size=args.batch_size, shuffle=True,
             num_workers=args.workers, pin_memory=True)

for i, (input, target) in enumerate(train_loader):
    ...

doyled_it · January 1, 2020, 12:08am

@GloryDream

Question #1: When I try this, it loops through the shorter dataset in the group. So if dataset A is 100 and dataset B is 1000 images and if I call ConcatDataset(dataset_A, dataset_B)[100], I’ll get a tuple with the contents filled by(dataset_A[0], dataset_B[100]). Does this make sense when putting this into a loader for training? Won’t I overfit on the smaller dataset?

Question #2: Now we don’t just have (input, target), we have ((input_1, target_1), (input_2, target_2)).

How do I train when the loader gives me a list of lists like this? Do I select randomly from the first list for my input? Or is this where weighted sampling comes in?

vishnu_vardhan · January 6, 2020, 5:10pm

I also have the same question.Please let me know what is the best way to solve this problem. I dont think we can use weighted random sampling here if yes please let me know how can i do it?

MarkovChain · July 1, 2020, 7:38pm

Hello I’m facing a similar problem and none of the solutions above are fitting. I’m running semi-supervised experiments and I’d like each batch to contain say n observations from the labelled data set set and say m observations from the unlabelled data set. Of course each of these go through different objective functions but are added together before making and optimization set. Thus I would really need to have loader formatted to sample from 2 two different data set at a time. Anyone know a ingenious to do so ?

Mario_Parreno · July 3, 2020, 11:48pm

class BalancedConcatDataset(torch.utils.data.Dataset):
    def __init__(self, *datasets):
        self.datasets = datasets
        self.max_len = max(len(d) for d in self.datasets)
        self.min_len = min(len(d) for d in self.datasets)

    def __getitem__(self, i):
        return tuple(d[i % len(d)] for d in self.datasets)

    def masks_collate(self, batch):
        # Only image - mask
        images, masks = [], []
        for item in range(len(batch)):
            for c_dataset in range(len(batch[item])):
                images.append(batch[item][c_dataset][0])
                masks.append(batch[item][c_dataset][1])
        images = torch.stack(images)
        masks = torch.stack(masks)
        return images, masks

    def __len__(self):
        return self.max_len

It would be masks or labels

mohammed_guermal · July 7, 2020, 10:00am

Hi @apaszke when i use this function it transforms my dataset which is combined of tensors to lists is there a solution for this ??

Bert84258836 · July 15, 2020, 11:35pm

Any luck on a solution @MarkovChain? Currently I pass multiple datasets to CycleConcatDataset and then define a dataloader on it with a single batch size. This essentially will batch all the datasets and will cycle through the shorter ones until the longest dataset finishes.

In my use case (semi supervised and domain adaptation) I would like to keep the parameter updates as balanced as possible. This cycling method is a bit unfair as the shorter datasets update the parameters more.

I think one way to help my particular use case is to somehow use different batch sizes for each dataset.

class CycleConcatDataset(data.Dataset):
    '''Dataset wrapping multiple train datasets
    Parameters
    ----------
    *datasets : sequence of torch.utils.data.Dataset
        Datasets to be concatenated and cycled
    '''
    def __init__(self, *datasets):
        self.datasets = datasets

    def __getitem__(self, i):
        result = []
        for dataset in self.datasets:
            cycled_i = i % len(dataset)
            result.append(dataset[cycled_i])

        return tuple(result)

    def __len__(self):
        return max(len(d) for d in self.datasets)

Martin_Pflaum · July 26, 2020, 8:13pm

If you are looking for using multiple dataloaders at the same time this should work


class cat_dataloaders():
    """Class to concatenate multiple dataloaders"""

    def __init__(self, dataloaders):
        self.dataloaders = dataloaders
        len(self.dataloaders)

    def __iter__(self):
        self.loader_iter = []
        for data_loader in self.dataloaders:
            self.loader_iter.append(iter(data_loader))
        return self

    def __next__(self):
        out = []
        for data_iter in self.loader_iter:
            out.append(next(data_iter)) # may raise StopIteration
        return tuple(out)

Here is a quick example

class DEBUG_dataset(Dataset):
    def __init__(self,alpha):
        self.d = (torch.arange(20) + 1) * alpha
    def __len__(self):
        return self.d.shape[0]
    def __getitem__(self, index):
        return self.d[index]

train_dl1 = DataLoader(DEBUG_dataset(10), batch_size = 4,num_workers = 0 , shuffle=True)
train_dl2 = DataLoader(DEBUG_dataset(1), batch_size = 4,num_workers = 0 , shuffle=True)
tmp = cat_dataloaders([train_dl1,train_dl2])
for x in tmp:
    print(x)

output is

(tensor([140, 160, 130,  90]), tensor([ 5, 10,  8,  9]))
(tensor([120,  30, 170,  70]), tensor([15, 17, 18,  7]))
(tensor([180,  50, 190,  80]), tensor([ 6, 14,  3,  2]))
(tensor([ 10,  40, 150, 100]), tensor([11, 13,  4,  1]))
(tensor([ 60, 200, 110,  20]), tensor([19, 12, 20, 16]))

engmubarak48 · September 11, 2020, 3:47pm

Bro, thanks for saving my time lol.

AlongWY · October 14, 2020, 7:23am

import numpy as np


def cycle(iterable):
    while True:
        for x in iterable:
            yield x


class MultiTaskDataloader(object):
    def __init__(self, tau=1.0, **dataloaders):
        self.dataloaders = dataloaders

        Z = sum(pow(v, tau) for v in self.dataloader_sizes.values())
        self.tasknames, self.sampling_weights = zip(*((k, pow(v, tau) / Z) for k, v in self.dataloader_sizes.items()))
        self.dataiters = {k: cycle(v) for k, v in dataloaders.items()}

    @property
    def dataloader_sizes(self):
        if not hasattr(self, '_dataloader_sizes'):
            self._dataloader_sizes = {k: len(v) for k, v in self.dataloaders.items()}
        return self._dataloader_sizes

    def __len__(self):
        return sum(v for k, v in self.dataloader_sizes.items())

    def __iter__(self):
        for i in range(len(self)):
            taskname = np.random.choice(self.tasknames, p=self.sampling_weights)
            dataiter = self.dataiters[taskname]
            batch = next(dataiter)

            batch['task'] = taskname

            yield batch

Rabeeh_Karimi · November 8, 2020, 5:30pm

Hi,
could you provide me with how one can define distributed Sampler for the MultiTaskDataloader that @AlongWY wrote? This is basically for training a model across multiple TPU cores, where data needs to be distributed over multiple cores. thanks a lot in advance.

Rabeeh_Karimi · November 9, 2020, 8:29pm

Hi there,
could you provide an example, in case this was not iterable dataset, but was mapping based on, how would the sampling be done? thanks

anujshah645 · January 28, 2021, 8:45am

HI I found a much easier solution and wanted to share here

dataset_3 = torch.utils.data.ConcatDataset((dataset_1,dataset_2))
each of the dataset are of type torch.utils.data.dataset.Dataset

this command helped me to concatenate both the dataset and later prepare a data loader from it.
len(dataset_1)=200
len(dataset_2)=300
len(dataset_3)=500

WH08 · November 9, 2021, 12:47pm

Thank you, it really helps.

lanka · February 24, 2022, 5:44am

im getting

RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.