DataLoader as a list - shuffle implicit pairs

mhusseinsh · September 5, 2018, 5:15am

Is there a way to handle the DataLoader as a list ? The idea is that I want to shuffle implicit pairs of images, without setting the shuffling into True

Basically, I have for example 10 scenes, each containing let’s say 100 sequences, so they are represented inside the directory as
'1_1.png', '1_2.png', '1_3.png', '....., '2_1.png', '2_2.png', '2_3.png', ...., '3_1.png', '3_2.png', '3_3.png', ..., ...., '10_1.png', '10_2.png', '10_3.png'

I don’t want complete shuffling of data, what I want simply is to shuffle but keeping pairs, so they are represented in the data loader as
[ '1_3.png', '1_4.png', '2_2.png', '2_3.png', '10_1.png', '10_2.png', '1_2.png', '1_3.png', ...]
and so on

Please have a look at this question which I have already asked on Stack Overflow concerning shuffling array of implicit pairs, where you can understand what I mean

As an example:
if this is a list

L = [['1_1'],['1_2'],['1_3'],['1_4'],['1_5'],['1_6'],['2_1'],['2_2'],['2_3'],['2_4'],['2_5'],['2_6'],['3_1'],['3_2'],['3_3'],['3_4'],['3_5'],['3_6']]

then this is the output

[['1_2'], ['1_3'], ['2_1'], ['2_2'], ['2_4'], ['2_5'],
 ['2_2'], ['2_3'], ['1_3'], ['1_4'], ['3_4'], ['3_5'],
 ['3_3'], ['3_4'], ['3_2'], ['3_3'], ['1_6'], ['2_1'],
 ['2_5'], ['2_6'], ['2_6'], ['3_1'], ['1_4'], ['1_5'],
 ['1_1'], ['1_2'], ['2_3'], ['2_4'], ['1_5'], ['1_6'],
 ['3_1'], ['3_2'], ['3_5'], ['3_6']]

I want to achieve this for a DataLoader

The main idea, is that I want to train my network on sequential frames, but it doesn’t have to be the complete sequence, but at least I need each step, two sequences are there

ptrblck · September 5, 2018, 11:06am

If you sort your sequences so that you’ll have adjacent frames, what should a batch contain?
E.g. if you specify batch_size=3, you would get a batch containing frames [1_1], [1_2], [3_4].
Would this be OK?
What if you’ll get [1_1], [1_2], [1_2]? Would this still be alright?

Or should the batch only contain sequential frames of one sequence, e.g. [1_3], [1_4], [1_5]?

mhusseinsh · September 5, 2018, 11:38am

@ptrblck It is ok for me to have batch_size=1 because I already save the previous batch before I process the second one … due to memory consumption, i cannot increase the batch_size, so I save the previous tensor manually, the most important thing for me that I want each 2nd batch is a consecutive frame to the previous one
so I can have [1_1], [1,2], [3_3], [3_4], [4_8], [4_9] ...

ptrblck · September 5, 2018, 12:33pm

OK, I see.
You could create a Sampler and shuffle the paired indices.
We would have to take care of invalid indices, e.g. between sequences ([1_4], [2_1]), which should not be possible.
I’ve created a small example:


class MySampler(torch.utils.data.Sampler):
    def __init__(self, data_source, invalid_idx):
        self.data_source = data_source
        self.invalid_idx = invalid_idx
        
    def __iter__(self):
        indices = torch.arange(len(self.data_source))
        paired_indices = indices.unfold(0, 2, 1)
        paired_indices = torch.stack(
            [paired_indices[i] for i in range(len(paired_indices)) 
                if not i in invalid_idx])
        paired_indices = paired_indices[torch.randperm(len(paired_indices))]
        indices = paired_indices.view(-1)

        return iter(indices.tolist())
    
    def __len__(self):
        return len(self.data_source)


class MyDataset(Dataset):
    def __init__(self, data):
        self.data = data
        
    def __getitem__(self, index):
        x = self.data[index]
        return x
    
    def __len__(self):
        return len(self.data)


data = torch.tensor([11, 12, 13, 21, 22, 23, 31, 32, 33], dtype=torch.float)
invalid_idx = torch.tensor([2, 5, 8])
dataset = MyDataset(data)
sampler = MySampler(data, invalid_idx)
loader = DataLoader(
    dataset,
    batch_size=1,
    sampler=sampler
)

for x in loader:
    print(x)

Using this approach you would have to make sure to sort the data and calculate the invalid indices (sequence changes) before.

Let me know, if that works for you.

mhusseinsh · September 7, 2018, 5:30am

@ptrblck
what I have is that the data is read from a folder like that

class ImageFolder(data.Dataset):

    def __init__(self, root, transform=None, return_paths=False,
                 loader=default_loader):
        imgs = sorted(make_dataset(root))
        if len(imgs) == 0:
            raise(RuntimeError("Found 0 images in: " + root + "\n"
                               "Supported image extensions are: " +
                               ",".join(IMG_EXTENSIONS)))

        self.root = root
        self.imgs = imgs
        self.transform = transform
        self.return_paths = return_paths
        self.loader = loader

    def __getitem__(self, index):
        path = self.imgs[index]
        img = self.loader(path)
        if self.transform is not None:
            img = self.transform(img)
        if self.return_paths:
            return img, path
        else:
            return img

    def __len__(self):
        return len(self.imgs)

def get_data_loader_folder(input_folder, batch_size, train, new_size=None,
                           height=256, width=256, num_workers=4, crop=True):
    transform_list = [transforms.ToTensor(),
                      transforms.Normalize((0.5, 0.5, 0.5),
                                           (0.5, 0.5, 0.5))]
    transform_list = [transforms.RandomCrop((height, width))] + transform_list if crop else transform_list
    transform_list = [transforms.Resize((256, 256))] + transform_list if new_size is not None else transform_list
    transform_list = [transforms.RandomHorizontalFlip()] + transform_list if train else transform_list
    transform = transforms.Compose(transform_list)
    dataset = ImageFolder(input_folder, transform=transform)
    loader = DataLoader(dataset=dataset, batch_size=batch_size, shuffle=train, drop_last=True, num_workers=num_workers)
    return loader

So where exactly can I put the sampler in this case ?
Also another question, I believe that the output in this case for each batch would be a pair, right ? I think this will already cause memory error in my case, they don’t exactly have to be pairs in one batch, what I want is that they are sorted in the manner of shuffled pairs, and concerning the invalid indices, I already have a threshold calculation, which tells me if the current frame(image) is from the same sequence of the previous image or not, or even I can do it every 2nd batch … this can be easily doable I think so

mhusseinsh · September 7, 2018, 5:36am

the problem with the invalid indices idea, is that not all sequences have the same number of frames, so theoretically, I don’t know exactly which are the valid indices, and they are total of 20K images, so I can’t really check this manually

ptrblck · September 7, 2018, 12:46pm

My example return one sample sequentially instead of a pair.
One exemplary output would be:

tensor([32.])
tensor([33.])
tensor([21.])
tensor([22.])
tensor([12.])
tensor([13.])
tensor([11.])
tensor([12.])
tensor([22.])
tensor([23.])
tensor([31.])
tensor([32.])

For this approach you would need all data to be shuffled individually and not as pairs.
Would this be possible?
Regarding the invalid indices, could you calculate them automatically using your file names and a regular expression?

mhusseinsh · September 7, 2018, 12:48pm

ok this output is fine for me
the problem with indices I am using two datasets, and each of them the number of sequences is not the same
but we can exclude this option, this would still be ok with me

Deeply · September 7, 2018, 1:06pm

I guess the easiest way to do it is by using class torch.utils.data. SubsetRandomSampler ( indices )

You just need to input the indices for the sample pairs, whether they are in the same Dataloader or even two or more Dataloaders, and off you go.

Still, you will have to build the Dataloader if you do not have it.

mhusseinsh · September 7, 2018, 1:13pm

No I didn’t mean that the images are in two different dataloaders, I mean that I am training on two different datasets (it’s a GAN), so for every domain, I have some sequences of frames, but the sequences here and there are not equal. But anyways, as I said before, I have a threshold to tell me if the current batch is related to the previous frame (from the same sequence) or not.

So what @ptrblck suggested, I think it can do the work, but without having invalid indices, because I don’t need it I think

ptrblck · September 7, 2018, 2:15pm

The invalid indices are just used to avoid situations like:

1_1,
1_2, # ok, because sequential

2_3,
2_4, # ok, because sequential

3_6,
4_1, # not ok, because from different sequence (3 vs. 4)

If that’s alright, you could skip the invalid indices, but as I said, the DataLoader will return “sequential” samples from different sequences.

mhusseinsh · September 7, 2018, 2:23pm

@ptrblck yes this is what I want, sequential samples from different sequences (so shuffled)

can you guide me on how to add the sampler in my code example

ptrblck · September 7, 2018, 2:34pm

But the last two samples should be invalid, right?
If so, we would need the invalid indices calculated from your dataset.
If not, just remove the invalid index filters in my sampler:

paired_indices = torch.stack(
   [paired_indices[i] for i in range(len(paired_indices)) 
        if not i in invalid_idx])

and pass the sampler to your DataLoader:

loader = DataLoader(
    dataset,
    batch_size=1,
    sampler=sampler
)