Dependent Batches in DataSet

melgor · February 21, 2017, 10:48am

Hi,
I’m trying to figure out how should I implement loading data, where examples in one batch are depended on each other. The exactly idea is following:

Define the number of classes and classes per example per batch, Batch size is num_class x num_exp_per_class.
In each iteration take random classes (exactly num_class classes). Then for each class take num_exp_per_class examples

Currently all examples which I saw use “RandomSampler” or "SequentialSampler"
Both of this sampler are only shuffling the idxes, but I want sth more. My idea is following:

Write a sampler which will generate idxes using num_class and num_exp_per_class. So it will generate idxes for only one batch. For such operation is will only need labels.
Use DataLoaderIter to sample data, but use sampler function after each batch. (Theoretically I could use sampler to generate all idxes at once, but I’m planing to add sth like Hard-Negative Mining and then I will be better to selext idxes after each interations).

What do you think about this strategy?

apaszke · February 26, 2017, 10:07pm

Hey, sorry for a late reply.

I didn’t really understand your solution, sampler is not a function, but an iterable (you can think of it as a stream of indices). I think that one way to impement what you want would be to:

Use vanilla DataLoader with batch_size=num_class.
When your dataset is asked for ith sample, you look up its class and sample num_exp_per_class-1 other examples from that class, concatenate them into a “sub-batch” and return that.
The data loader will return num_class x num_exp_per_class x <feature dims> elements.

melgor · March 1, 2017, 7:29am

Thanks for the reply.

Your idea is much easier than my. I implement your idea and it works well. The dataset return Tensor of size [ num_example_per_class x channels x width x height].
The only downsize is that DataLoader return the tensor of size (as it is stacking it, not concatenating): [num_class_per_batch x num_example_per_class x channels x width x height] so I changed torch.stack to torch.cat in collate_fn . Now it works well.
I also added new sampler because when I have 10 classes and batch_size=3, I have only 4 batches per epoch. The new sampler have the parameter of num_iters which point how many iteration is one epoch.
What do you think about my implementation?

Here is my code:

from PIL import Image

import sys
import collections

def collate_cat(batch):
    '''concatenate sub-batches along first dimension'''
    if torch.is_tensor(batch[0]):
        return torch.cat(batch,0)
    elif isinstance(batch[0], collections.Iterable):
        # if each batch element is not a tensor, then it should be a tuple
        # of tensors; in that case we collate each element in the tuple
        transposed = zip(*batch)
        return [collate_cat(samples) for samples in transposed]

    raise TypeError(("batch must contain tensors, numbers, or lists; found {}"
                     .format(type(batch[0]))))

class RandomSampler(object):
    """Samples num_iters times the idxes from class
    Arguments:
        data_source (Dataset): dataset to sample from
    """

    def __init__(self, data_source, num_iters):
        self.num_samples = len(data_source)
        self.num_iters   = num_iters

    def __iter__(self):
        return iter(torch.cat([torch.randperm(self.num_samples) for i in range(self.num_iters)], 0).long())

    def __len__(self):
        return self.num_iters
    
    
class TripletDataset(datasets.ImageFolder):
    def __init__(self, root, transform=None, target_transform=None, num_iters = 1000, \
                     class_per_batch = 10, example_per_class = 6):
        super(self.__class__, self).__init__(root, transform, target_transform)
        self.class_per_batch   = class_per_batch
        self.example_per_class = example_per_class
        
        # Create a dictionary which will contain idxes of each class separately\
        self.class_dict = defaultdict(list)
        for idx, (path, target) in enumerate(self.imgs):
            self.class_dict[target].append(idx)
               
        
    def __getitem__(self, index):
        class_data  = self.class_dict[index]
        num_example = min(len(class_data), self.example_per_class)
        shuffle     = torch.randperm(len(class_data))
        idxes       = shuffle[:num_example]
        list_imgs    = list()
        list_targets = list()
        for idx in idxes:
            path, target = self.imgs[class_data[idx]]
            img = self.loader(os.path.join(self.root, path))
            if self.transform is not None:
                img = self.transform(img)
            if self.target_transform is not None:
                target = self.target_transform(target)
            list_imgs.append(img)
            list_targets.append(target)
            
        return torch.stack(list_imgs, 0), torch.LongTensor(list_targets) #img, target

    def __len__(self):
        return len(self.class_dict.keys())



data = TripletDataset("MNIST/train/",transforms.Compose([transforms.ToTensor()]))
train_loader = torch.utils.data.DataLoader(
        data,
        batch_size=3, shuffle=True,
        sampler = RandomSampler(data, 100 * 3),
        num_workers=0, pin_memory=True,
        collate_fn = collate_cat)

for img, target in train_loader:
    print (img.size(), target.size())

apaszke · March 1, 2017, 11:45am

I think that you don’t need to provide a custom collate function. You can just call .view on the output tensors to collapse the additional dimension.

About the iterator, it would be slightly more efficient to recreate the permutations only once they’re needed, not when the iterator is instantiated, but as long as your dataset isn’t huge it probably doesn’t matter too much.

I didn’t read the example very carefully but it looks good.