Use two datasets simultaneously and feed into different pathways of same model

kl_divergence · July 18, 2018, 4:27pm

My model is like this:

class hybrid_cnn(nn.Module):
    def __init__(self,**kwargs):
        super(hybrid_cnn,self).__init__()
        resnet = torchvision.models.resnet50(pretrained=True)
        self.base = nn.Sequential(*list(resnet.children())[:-2]

        return clf_outputs

I have two pathways named as augmented1 and augmented2 which are a set of convolutional layers. Now i want to feed two datasets (Dataset A and Dataset B) to my base model which would then be passed on two these two (augmented1 and augmented2 ) pathways. So this could only be done with batches. So i want to have batch size of 64, 32 samples from DatasetA need to go to pathway1 (base model->pathway1) and another 32 samples from other DatasetB need to go through pathway2 (base model->pathway2). Base Model is common to both pathways. Also i want to pick up batches in the same order when i am setting shuffle=True since I will be extracting feature maps from both pathways

kl_divergence · July 20, 2018, 2:40pm

@ptrblck could you please provide some solution?

aplassard · July 20, 2018, 2:43pm

You could just set up two separate forward functions for each dataset. For instance
`

def forwardA(self, x):
    x = self.base(x)
    clf_output = getattr(self, "fc0")(x)
    return clf_output

def forwardB(self, x):
    x = self.base(x)
    clf_output = getattr(self, "fc1")(x)
    return clf_output

`

ptrblck · July 20, 2018, 6:36pm

Let me summarize your use case and correct me please, if I’m wrong.

You have two datasets and would like to get a single batch of 64 samples consisting of 32 samples of datasetA and 32 or datasetB.

The first part should go through augmented1, the second through augmented2. Both should use the base model.

Could you explain the last sentence regarding the same order with shuffle=True?

kl_divergence · July 20, 2018, 6:52pm

You’re absolutely right. What I mean by last part is that when I will shuffle the samples from both dataset (data loader will do that), i don’t want them to shuffle separately rather it should shuffle both in one go since it contains images from different sources and the sequence of samples from sources matter. I would be extracting feature maps afterwards

aplassard · July 20, 2018, 6:54pm

Assuming you maintain some index of which samples are A and which are B you can just simply index into the outputs and calculate the loss on each output only with respect to the right population.

ptrblck · July 20, 2018, 7:39pm

OK, I think I’ve understood it.
I assume both datasets have the same length.
Here is a small example. I’ve modified your model to work with dummy data:


class hybrid_cnn(nn.Module):
    def __init__(self,**kwargs):
        super(hybrid_cnn,self).__init__()
        resnet = torchvision.models.resnet50(pretrained=False)
        self.base = nn.Sequential(*list(resnet.children())[:-2])
        
        setattr(self,"fc0",nn.Linear(100352, 2))
        setattr(self,"fc1",nn.Linear(100352, 2))


    def forward(self,x):
        x = self.base(x)
        clf_outputs = {}
        num_fcs = 2
        x = x.view(x.size(0), -1)
        xs = torch.cat([x[::2], x[1::2]])
        for i in range(num_fcs):
            clf_outputs["fc%d" %i] = getattr(self, "fc%d" %i)(xs[i])

        return clf_outputs


class MyDatasetA(Dataset):
    def __init__(self):
        self.data = torch.randn(640, 3, 224, 224)
        
    def __getitem__(self, index):
        return self.data[index]
    
    def __len__(self):
        return len(self.data)

    
class MyDatasetB(Dataset):
    def __init__(self):
        self.data = torch.randn(640, 3, 224, 224)
        
    def __getitem__(self, index):
        return self.data[index]
    
    def __len__(self):
        return len(self.data)


class MyDatasetC(Dataset):
    def __init__(self):
        self.datasetA = MyDatasetA()
        self.datasetB = MyDatasetB()
        
    def __getitem__(self, index):
        dataA = self.datasetA[index].unsqueeze(0)
        dataB = self.datasetB[index].unsqueeze(0)
        
        data = torch.cat((dataA, dataB), 0)
        return data
    
    def __len__(self):
        return len(self.datasetA)
    
dataset = MyDatasetC()
x = dataset[0]
x.shape

loader = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=1
)

# Your training routine (just one iteration)
loader_iter = iter(loader)
x = loader_iter.next()
x = x.view(-1, 3, 224, 224)
model = hybrid_cnn()
output = model(x)

Let me know, if this works for you.

EDIT: I think the datasets are currently interleaved. Let me check it real quick.
EDIT2: Should work now.

kl_divergence · July 21, 2018, 4:58am

Thanks for helping! I had a quick question. Here’s what I’ve implemented:

__img_factory = {
    'market1501': Market1501,
    'cuhk03': CUHK03,                                                                                                                                                                                                                                                                                                                                    
}

def init_img_dataset(name, **kwargs):
    if name not in __img_factory.keys():
        raise KeyError("Invalid dataset, got '{}', but expected to be one of {}".format(name, __img_factory.keys()))
    return __img_factory[name](**kwargs)

class ImageDataset(Dataset):
    def __init__(self,dataset,transform=None):
        self.dataset = dataset
        self.transform = transform
    
    def __len__(self):
        return len(self.dataset)
    
    def __getitem__(self,index):
        img_path,pid,camid = self.dataset[index]
        img = read_image(img_path)
        if self.transform is not None:
            img = self.transform(img)
        return img,pid,camid

dataset = dataset_manager.init_img_dataset(
    root='data',name=dataset_name)
)

And my trainloader looks like this:

trainloader = DataLoader(
    ImageDataset(dataset.train,transform=tfms_train),
    sampler = RandomIdentitySampler(dataset.train,num_instances=num_instances),
    batch_size = train_batch,num_workers=workers,
    pin_memory=pin_memory,drop_last=True,
)

How can I adjust these in self.data in __init__ method ?

ptrblck · July 21, 2018, 6:43am

You could also create them outside your DatasetC class, just pass the instance into __init__, and assign it to the member.

kl_divergence · July 21, 2018, 6:46am

I didn’t get you, You used a Tensor in self.data, but my dataset is an object, which when wrapped under ImageDataset gives an iterator, how can i make it usable with self.data in datasetA and datasetB

ptrblck · July 21, 2018, 6:56am

Instead of the random tensor I used in datasetA and datasetB, you should use valid data like in your ImageDataset.

From your code snippet it looks like you just have one dataset and the RandomIdentitySampler somehow samples the two batches?

kl_divergence · July 21, 2018, 7:00am

It Randomly samples N identities, then for each identity,
randomly sample K instances, therefore batch size is N*K.
Here is the snippet:

class RandomIdentitySampler(Sampler):
    """
    Args:
        data_source (Dataset): dataset to sample from.
        num_instances (int): number of instances per identity.
    """
    def __init__(self, data_source, num_instances=4):
        self.data_source = data_source
        self.num_instances = num_instances
        self.index_dic = defaultdict(list)
        for index, (_, pid, _) in enumerate(data_source):
            self.index_dic[pid].append(index)
        self.pids = list(self.index_dic.keys())
        self.num_identities = len(self.pids)

    def __iter__(self):
        indices = torch.randperm(self.num_identities)
        ret = []
        for i in indices:
            pid = self.pids[i]
            t = self.index_dic[pid]
            replace = False if len(t) >= self.num_instances else True
            t = np.random.choice(t, size=self.num_instances, replace=replace)
            ret.extend(t)
        return iter(ret)

    def __len__(self):
        return self.num_identities * self.num_instances

ptrblck · July 21, 2018, 7:17am

Ok, so is your code working already? I posted another approach using two separate Datasets while you are apparently sampling the two classes from one.
If your code is not working properly, could you try to adapt it to mine?

kl_divergence · July 21, 2018, 7:19am

I want to adapt to yours only. Mine does’t work the way I have mentioned. While adapting to yours, I wanted to know how can i feed dataset in DatasetA and DatasetB since my form is different. I am not sure how can i add my both datasets in init of datasetA and datasetB. I just wanted you to know how was I performing training earlier and want to transition to yours completely

ptrblck · July 21, 2018, 7:49am

Ok, got it.
It seems that your datasets are somehow loaded using __img_factory. You just need to pass the name to your dataset_manager and it will create the appropriate dataset for the passed class?
If so, can you create two separate datasets for your two classes?

kl_divergence · July 21, 2018, 9:00am

That’s where I’m stuck, using those two classes (datasetA and datasetB) to use dataloaders.
I was iterating earlier like,

for batch,(imgs,pids,camids) in enumerate(trainloader):

Now I want to be able to feed half of the batch to augmented 1(datasetA) and the other half to augmented 2(datasetB). Your approach (self.data) expects a Tensor , instead i want to use your approach with the form I have i.e

dataset_ = dataset_manager.init_img_dataset(
    root='data',name=dataset_name # Can be datasetA or datasetB
)

So as per your approach ( i want to implement in class datasetA and class datasetB).
So that it may become of the form self.data = dataset_) So how can I possibly do that ?

ptrblck · July 21, 2018, 9:22am

Just try to assign your dataset to self.data.
My classes uses currently tensors, but you basically need a class which returns tensors when indexing. Your Dataset should be just fine.

kl_divergence · July 21, 2018, 9:24am

I already tried this, this is what I get

TypeError: __init__() takes from 1 to 2 positional arguments but 4 were given

ptrblck · July 21, 2018, 9:28am

Are you passing your datasets as arguments?
Have you modified your __inti__?

kl_divergence · July 21, 2018, 9:29am

Yeah, here it is :

dataset = dataset_manager.init_img_dataset(
    root='data',name=dataset_name
)

class MyDatasetA(dataset):
    
    def __init__(self):
        self.data = dataset
    
    def __getitem__(self,index):
        return self.data[index]
    
    def __len__(self):
        return len(self.data)