Use two datasets simultaneously and feed into different pathways of same model

kl_divergence · July 21, 2018, 9:31am

Here is dataset_manager if you require

gist.github.com

https://gist.github.com/aijournal/eb107cd03ebe08169c69ad7a847c0916

dataset_manager.py

from __future__ import print_function, absolute_import
import os
import glob
import re
import sys
import urllib
import tarfile
import zipfile
import os.path as osp
from scipy.io import loadmat

This file has been truncated. show original

kl_divergence · July 22, 2018, 7:29am

@ptrblck Could you please tell where I am going wrong ? I followed your step and am passing dataset (snippet above) as argument. I am fairly new to managing multiple datasets at once and there are no implementations available as well for this.

ptrblck · July 22, 2018, 8:24am

What kind of error message do you get?
In your code snippet you are not deriving from Dataset and apparently you use the global dataset without passing.
Do you get a valid sample calling dataset[0]?

kl_divergence · July 22, 2018, 2:22pm

dataset.train gives me a list (here’s a sample)

[('data/dukemtmc-reid/DukeMTMC-reID/bounding_box_train/1174_c5_f0065370.jpg',
  599,
  4),
 ('data/dukemtmc-reid/DukeMTMC-reID/bounding_box_train/0296_c2_f0103512.jpg',
  203,
  1),

in the order image path, pid,camid .
When i run this :

class MyDatasetA(dataset.train):
    
    def __init__(self):
        self.data = dataset.train
    
    def __getitem__(self,index):
        return self.data[index]
    
    def __len__(self):
        return len(self.data)

It gives the following error:
TypeError: list() takes at most 1 argument (3 given). What does your method expect ? What is Dataset here actually (your approach) ?

class MyDatasetA(Dataset):
    def __init__(self):
        self.data = torch.randn(640, 3, 224, 224)

I’m figuring out ways to mould what I’ve been using to your approach .

ptrblck · July 22, 2018, 3:51pm

You have to implement the load functions for your dataet in __getitem__ so that you can return a data and target sample.
In my simple example I just created random image data, which is sliced in __getitem__.
In your case, you would have to get the image path etc. and load the image as you probably have done in your old Dataset.

ptrblck · July 22, 2018, 9:32pm

Basically your code should look like this:

class MyDatasetA(dataset.train):
    
    def __init__(self):
        self.data = dataset.train
    
    def __getitem__(self,index):
        image_path, pid, camid = self.data[index]
        # Load image
        image = PIL.Image.open(image_path)
        # Process image
        ....
        # Convert image to tensor
        x = TF.to_tensor(image)
        # Process and transform pid and camid 
        ....
        return x, pid, camid
    
    def __len__(self):
        return len(self.data)

I’m not sure though, how you want to use pid and camid.

kl_divergence · July 23, 2018, 4:40pm

I get it what you’re saying, but if you have a look at dataset_manager , I am doing a lot of preprocessing operations and I am unable to integrate with __getitem__ . I am dealing with three datasets, so getitem__ would vary for each one of them, and an altogether different approach is required . Is there any some other way out ?

pids are the the target labels which I’m passing to my loss function, camids are being used to create a mask for sampling at later stage:

output_fc = "fc1"
for batch,(imgs,pids,camids) in enumerate(trainloader):
    clf_outputs = model(imgs.cuda())
    fc1 = clf_outputs[output_fc]
    cross_entropy_loss = cross_entropy(fc1,pids)

dataset = dataset_manager.init_img_dataset(
    root='data',name=dataset_name,split_id=split_id,
    cuhk03_labeled=cuhk03_labeled,cuhk03_classic_split=cuhk03_classic_split,
)

My trainloader looks like:

trainloader = DataLoader(
    ImageDataset(dataset.train,transform=tfms_train),
    sampler = RandomIdentitySampler(dataset.train,num_instances=num_instances),
    batch_size = 32 ,num_workers=workers,
    pin_memory=pin_memory,drop_last=True,
)

Dataset_manager here. It is big (seperate for three datasets since format of them is different)

tfms_train = tfms.Compose([
    tfms.Random2DTranslation(height,width),
    tfms.RandomHorizontalFlip(),
    tfms.ToTensor(),
    tfms.Normalize(mean=[0.485,0.456,0.406],std=[0.229,0.224,0.225]),
])
class ImageDataset(Dataset):
    def __init__(self,dataset,transform=None):
        self.dataset = dataset
        self.transform = transform
    
    def __len__(self):
        return len(self.dataset)
    
    def __getitem__(self,index):
        img_path,pid,camid = self.dataset[index]
        img = read_image(img_path)
        if self.transform is not None:
            img = self.transform(img)
        return img,pid,camid

class RandomIdentitySampler(Sampler):
    """
    Randomly sample N identities, then for each identity,
    randomly sample K instances, therefore batch size is N*K.
    Args:
        data_source (Dataset): dataset to sample from.
        num_instances (int): number of instances per identity.
    """
    def __init__(self, data_source, num_instances=4):
        self.data_source = data_source
        self.num_instances = num_instances
        self.index_dic = defaultdict(list)
        for index, (_, pid, _) in enumerate(data_source):
            self.index_dic[pid].append(index)
        self.pids = list(self.index_dic.keys())
        self.num_identities = len(self.pids)

    def __iter__(self):
        indices = torch.randperm(self.num_identities)
        ret = []
        for i in indices:
            pid = self.pids[i]
            t = self.index_dic[pid]
            replace = False if len(t) >= self.num_instances else True
            t = np.random.choice(t, size=self.num_instances, replace=replace)
            ret.extend(t)

        return iter(ret)

    def __len__(self):
        return self.num_identities * self.num_instances

Later for testing purposes, I use:

queryloader = DataLoader(
    ImageDataset(dataset.query,transform=tfms_test),
    batch_size=test_batch,shuffle=False,num_workers=workers,
    pin_memory=pin_memory,drop_last=False,
)

galleryloader = DataLoader(
    ImageDataset(dataset.gallery,transform=tfms_test),
    batch_size=test_batch,shuffle=False,num_workers=workers,
    pin_memory=pin_memory,drop_last=False, 
)

Can i somehow combine two trainloaders(one for datasetA and datasetB) into one using some approach or some closely related approach that may work with this? @ptrblck

ptrblck · July 24, 2018, 8:44pm

I’ve skimmed your code and tried to understand the complications.

I assume your current code works.
If so, would this work for you:

class ConcatImageDataset(Dataset):
    def __init__(self,datasetA, datasetB, transformA=None, transformB=None):
        self.datasetA = datasetA
        self.datasetB = datasetB
        self.transformA = transformA
        self.transformB = transformB
    
    def __len__(self):
        return len(self.datasetA) # both datasets should have same size
    
    def __getitem__(self,index):
        img_pathA, pidA, camidA = self.datasetA[index]
        imgA = read_image(img_pathA)
        if self.transformA is not None:
            imgA = self.transformA(imgA)
        
        img_pathB, pidB, camidB = self.datasetB[index]
        imgB = read_image(img_pathB)
        if self.transformB is not None:
            imgB = self.transformB(imgB)
            
        return imgA, imgB, pidA, pidB, camidA, camidB