TarinLoader can be used for two datsets Pytorch?

Daniel_Joseph · June 14, 2019, 7:29am

How to use the Dataloader to load two different datasets like :


train_set = DataCustom(path=path, train=True)
train_loader = torch.utils.data.DataLoader(dataset=(train_set1, train_set2),
                                           batch_size=args.batch_size,
                                           pin_memory=True,
                                           shuffle=True,
                                           )

test_set = DataCustom(path=path, train=False)
test_loader = torch.utils.data.DataLoader(dataset=(test_set1,test_set2)
                                          batch_size=args.batch_size,
                                          pin_memory=True,
                                          shuffle=False,
                                          )

instead of writing two different dataloader like :

# load the First datasets 
train_set = DataCustom(path=path, train=True)
train_loader = torch.utils.data.DataLoader(dataset=train_set,
                                           batch_size=args.batch_size,
                                           pin_memory=True,
                                           shuffle=True,
                                           )
test_set = DataCustom(path=path, train=False)
test_loader = torch.utils.data.DataLoader(dataset=test_set,
                                          batch_size=args.batch_size,
                                          pin_memory=True,
                                          shuffle=False,
                                          )

# Load the second datasets 
train_set_2 = DataCustom(path=path_2, train=True)
train_loader_2 = torch.utils.data.DataLoader(dataset=train_set,
                                           batch_size=args.batch_size,
                                           pin_memory=True,
                                           shuffle=True,
                                           )

test_set_2 = DataCustom(path=path_2, train=False)
test_loader_2 = torch.utils.data.DataLoader(dataset=test_set,
                                          batch_size=args.batch_size,
                                          pin_memory=True,
                                          shuffle=False,
                                          )

Thanks in Advance (@ptrblck for you special thanks dude )

ptrblck · June 14, 2019, 8:58am

ConcatDataset is probably what you want.

PS: I’m not a fan of tagging, as this might discourage others to post an answer.

Daniel_Joseph · June 14, 2019, 10:09am

Yeah you are right , just because you are the one who answered all my questions so far , i though its gonna the same , that’s all.

Daniel_Joseph · June 14, 2019, 10:23am

I tried to do somehow like this :


def train(epoch):
    for batch_idx, (data, target), (data_2, target_2) in enumerate(train_loader, train_loader_2):
        if use_cuda:
            data, target = data.cuda(), target.cuda()
            data_2,  target_2= data_2.cuda(), target_2.cuda()

        data, target = Variable(data), Variable(target)
        data_2,  target_2 = Variable(data_2), Variable(target_2)

        optimizer.zero_grad()

        data = data.float()
        data_2 = data_2.float()
        output = model(data, data_2 )

        prec1, = accuracy(output.data, target.data)
        loss = criterion(output, torch.max(target, 1)[0])
        loss.backward()
        optimizer.step()

Since both data have same target, the accyracy will remain the same, i kept using two DataLoader for both datasets, i m wondering if there is anyway to do so ?

ptrblck · June 14, 2019, 12:08pm

If you are using shuffle=True as it seems to be the case, the target tensors will most likely be different.

I misunderstood your use case, as I thought you would like to extend the dataset and call the additional samples in a sequential way.
Could you explain the correspondence between both datasets, i.e. is train_set[0] corresponding to train_set_2[0]?
If so, I would suggest to implement a custom Dataset and return the pair for a single index.
Here is a small example:



class MyDataset(Dataset):
    def __init__(self, dataset1, dataset2):
        self.dataset1 = dataset1
        self.dataset2 = dataset2
        
    def __getitem__(self, index):
        x1, y1 = self.dataset1[index]
        x2, y2 = self.dataset2[index]
        
        if (y1 != y2).any():
            raise RuntimeError('ERROR! Target mismatch')
            
        return x1, x2, y1
        
    def __len__(self):
        return len(self.dataset1)  # Assuming both have same length

Daniel_Joseph · June 14, 2019, 12:37pm

How can i use the enumirate for the two dastasets ?

ptrblck · June 14, 2019, 2:13pm

You would wrap it in additional parentheses:

for idx, ((data1, target1), (data2, target2)) in enumerate(zip(loader1, loader2)):
    print(data1.shape)
    print(data2.shape)

Daniel_Joseph · June 14, 2019, 2:35pm

Thanks a lot, that’s what I was looking for, thousands thumbs