Train simultaneously on two datasets

lcelona · February 21, 2017, 8:33pm

Hello,

I should train using samples from two different datasets, so I initialize two DataLoaders:

train_loader_A = torch.utils.data.DataLoader(
             datasets.ImageFolder(traindir_A),
             batch_size=args.batch_size, shuffle=True,
             num_workers=args.workers, pin_memory=True)

train_loader_B = torch.utils.data.DataLoader(
             datasets.ImageFolder(traindir_B),
             batch_size=args.batch_size, shuffle=True,
             num_workers=args.workers, pin_memory=True)

What is the best way to extract samples from both iterators? In order to use something like this:

for i, (input, target) in enumerate(train_loader):
…

Thanks.

apaszke · February 21, 2017, 11:24pm

I’d recommend creating a new dataset and concatenating the images there, so the copy will be done inside the worker processes:

class ConcatDataset(torch.utils.data.Dataset):
    def __init__(self, *datasets):
        self.datasets = datasets

    def __getitem__(self, i):
        return tuple(d[i] for d in self.datasets)

    def __len__(self):
        return min(len(d) for d in self.datasets)

train_loader = torch.utils.data.DataLoader(
             ConcatDataset(
                 datasets.ImageFolder(traindir_A),
                 datasets.ImageFolder(traindir_B)
             ),
             batch_size=args.batch_size, shuffle=True,
             num_workers=args.workers, pin_memory=True)

for i, (input, target) in enumerate(train_loader):
    ...

lcelona · February 22, 2017, 4:10am

Perfect! Thank you for your help.

However, __getindex__ should be __getitem__, correct?

apaszke · February 22, 2017, 9:56am

Yes, that’s a typo. Sorry. I’ve changed the code.

Ahmed_Abbas · July 19, 2017, 3:25pm

Any idea on how to know, on the merged data whether the sample is coming from datasetA or datasetB in this case? I am trying to train a GAN-like network which needs to sample from two data sources and therefore needs the label in this case (A or B).

Sorry to wake-up an old thread

platero · September 27, 2017, 8:06am

I’ve read the code of ConcatDataset. One question is that: how can we control the ratio of data coming from a specific dataset?

For example, dataset A contains 100 data, and dataset B contains 10000 data. How can we get 1:1 data from both A and B in one mini-batch? Does anybody have any idea? Thank you.

cjb60 · February 15, 2018, 11:27pm

To add to platero’s reply, suppose for example that datasetA contains 100 elements and datasetB contains 10000. My impression is that the data loader will (in one epoch) create shuffled indices 1…100 for datasetA and shuffled indices 1…100 for dataset B and create batches from each of those (since the len of ConcatDataset is the minimum of the lengths of both A and B). However, datasetB also has elements from 101…10000, so these will not be accessed. Am I correct in my intuition here? If so, this doesn’t seem like a reasonable solution if one dataset is way smaller than the other.

mlopezantequera · March 20, 2018, 2:33pm

The current implementation does not discard data, it randomly samples from all the concatenated datasets. This means that each dataset will be sampled with probability len(dataset)/len(all_datasets)

To change this behavior, I think one could use WeightedRandomSampler, setting the appropriate weights.

mostafaaminnaji · May 18, 2018, 10:15pm

Hi Dear Adam
I want to train my models simultaneously on two datasets, but I want to pick batches in the same order with shuffle=True. but targets1 and targets2 are not same. For example:


train_dl1 = torch.utils.data.DataLoader(train_ds1, batch_size=16, 
                                       shuffle=True, num_workers=8)
train_dl2 = torch.utils.data.DataLoader(train_ds2, batch_size=16, 
                                       shuffle=True, num_workers=8)
inputs1, targets1 = next(iter(train_dl1))
inputs2, targets2 = next(iter(train_dl2))

targets1
tensor([ 1,  1,  0,  1,  0,  0,  1,  1])

targets2
tensor([ 0,  0,  0,  0,  0,  0,  0,  1])

I want to get targets1 and targets2 with the same order with the same shuffle. are you have any idea and can you help me?

Arshiya_Aggarwal · May 31, 2018, 5:59am

hey! this really helps but what if I have different samplers for both the data loaders? How do I train them simultaneously then?

kli-nlpr · July 20, 2018, 1:19pm

The same with you, do you have any ideas? thank you

PRAVEEN_KUMAR · July 30, 2018, 10:17pm

class ConcatDataset(Dataset):

def __init__(self,*datasets):
    self.datasets = datasets
    self.data_files =os.listdir('../data/cat1/cat05')

def __getitem__(self, i):

    return tuple(d[i] for d in self.datasets)

def __len__(self):
    return  min(len(d) for d in self.datasets)

#def loadImgs(des_dir = “…/data/”,img_size=128,batchSize = 4):

dataset1 = dset.ImageFolder(root=“…/data/cat1/”,
transform=transforms.Compose([
transforms.Resize(128),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
]))
dataset2 = dset.ImageFolder(root=“…/data/cat2/”,
transform=transforms.Compose([
transforms.Resize(128),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
]))
#
#print(dataset)
dataloader = torch.utils.data.DataLoader(
ConcatDataset(dataset1,dataset2),
batch_size= 4, # how many samples per batch to load
shuffle=True)

dataloadercat1 = torch.utils.data.DataLoader(
dataset1,
batch_size= 4, # how many samples per batch to load
shuffle=True)

#inputs1,targets1 = next(iter(dataset1))
#inputs2,targets2 = next(iter(dataset2))

for epoch in range(1):
for i,(inputs,lables) in enumerate(dataloader):

    print("Batch all ",i,lables)
for i, (inputs, lables) in enumerate(dataloadercat1):
    print("Batch cat ", i)
print(ConcatDataset(dataset1,dataset2).datasets)

PRAVEEN_KUMAR · July 30, 2018, 10:32pm

After implementation ConcatDataset with over data set than only smaller size subfolder is consider in batch formation. this reason some image of other category is missed out. you have any idea how to solved out this problem.?

PRAVEEN_KUMAR · July 30, 2018, 10:38pm

when i have developed this concept in my dataset than at the time of batch formation only smaller size data is consider in batch formation.
e.g data folder i have two subfolders cat1 and cat2. in cat1 70 images and cat2 30 images . Dataloader function make only 30 image considered in batch formation. 40 images of cat2 is not considered in batch formation.? can you have any solution which consider remaning image also

Haris_Cheong · September 10, 2018, 8:38am

Hi there, I have managed to use two datasets by creating a custom dataset that takes in two root directories:

class dataset_maker(Dataset):
    def __init__(self, root_dir1, root_dir2, transform= None):
        self.root_dir1=root_dir1
        self.root_dir2=root_dir2
        self.filelist1 = glob.glob(root_dir1+'*.png')
        self.filelist2 = glob.glob(root_dir2+'*.png')
        self.transform=transform
        
    def __len__(self):
        return min(len(self.filelist1),len(self.filelist2))
    
    def __getitem__(self, idx):
        sample1 = io.imread(self.filelist1[idx])/65535*255
        sample2 = io.imread(self.filelist2[idx])/65535*255
        sample1=np.uint8(sample1)
        sample2=np.uint8(sample2)
        sample1=PIL.Image.fromarray(sample1)
        sample2=PIL.Image.fromarray(sample2)
        if self.transform:
            sample1 = self.transform(sample1)
            sample2 = self.transform(sample2)
        return sample1,sample2

then, make a dataloader using the two datasets:

dataloader = DataLoader(combined_dataset, batch_size=3, shuffle=True, num_workers=4)

Finally, I get the data in the training loops by doing this call in the for loop:

for epoch in range(10):
    running_loss=0.0
    
    #get the data
    for batch_num, (hq_batch,Lq_batch) in enumerate(dataloader):
        print(batch_num, hq_batch.shape, Lq_batch.shape)

The output is stated below:

0 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])
1 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])
2 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])
3 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])
4 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])
5 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])
6 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])
7 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])
8 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])
9 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])
10 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])
11 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])
12 torch.Size([3, 3, 256, 256]) torch.Size([3, 3, 256, 256])

Hope this solves the problem!

Sumesh_Uploader · September 14, 2018, 3:01pm

how this combined_dataset is to be designed? Is that to be instance of Concat. I got this error while tried to form the combined dataset using dset.ImageFolder, :“TypeError: expected str, bytes or os.PathLike object, not ConcatDataset”

miladiouss · September 14, 2018, 9:53pm

I’m not sure if the for loop is setup correctly since __getitem__ is returning ( data points for A, labels for A) , (data points for B, labels for B) ).

Vij · February 27, 2019, 6:35pm

How do you train your model ?
You had passed batch_size =3 in the dataloader, but you got two batches of 3 sample. How do you use it to train model?

Vij · February 27, 2019, 6:49pm

Did you find the solution for this? I.e when samplers are not of same size?
Another question,. Can we use concat dataset for more than 2 samplers?

vishalthengane · March 19, 2019, 3:00am

Hi There,

Suppose I have two training dataset with different size and I am trying to train it on a network simultaneously, So I can do it? also, I need to keep a track of from which dataset image is coming to find out the loss after each iteration by the equation:

where,
L0 & L1 are the lengths of the dataset and Lambda is a balancing constant.

Thank you.