Maintaining order of data while using DataLoader

sangrock · December 17, 2018, 4:43am

Hello. I’m quite new here, so kind explanation will be really appreciated.

I have a data set that consists of two smaller data sets.
Set A : A_1, A_2, … , A_n
Set B : B_1, B_2, … , B_m
Here, m is larger than n.

I’m trying to hire torch.utils.data.DataLoader() to make one-size mini-batches from above data.
What I want is, each mini-batch be as follows.
(A_1, B_1)
(A_2, B_2)
…
(A_n, B_n)
(A_1, B_n+1)
…
(A_whatever, B_m)

When I run it, the order of A is perfect.
But problem is, the order of B be random.
Only options I used are mini-batch size and number of threads.
(And of course, these do not effects to my problem, right?)
I do not add any options: sampler, shuffle, etc.

How can I maintain the order of original data sets?

Latope2-150 · December 17, 2018, 4:31pm

How are you using DataLoader exactly?

adholmgren · December 17, 2018, 5:35pm

I’d say your options are either:

Create a new dataset class that will load both A and B
or
Concatenate your datasets and make a data sampler class that will sequentially load from A and B.

Option 1 would be more hard-coded to your specific problem, and I think option 2 could give you more flexibility (or you could do some mixture of the above). Here’s a potential way you could do it:

class SequentialSamplerAugdata(torch.utils.data.Sampler):
    def __init__(self, data_source):
        self.data_source = data_source
        
    def __iter__(self):
        n = len(self.data_source) // 2
        seq_A = range(n)
        seq_B = (x + n for x in seq_A)
        return iter(itertools.chain(*zip(seq_A, seq_B)))

    def __len__(self):
        return len(self.data_source)

both_sets = torch.utils.data.ConcatDataset((data_A, data_B))
both_sets_loader = torch.utils.data.DataLoader(both_sets, 
                                               batch_size=2, 
                                               sampler=SequentialSamplerAugdata(both_sets), 
                                               shuffle=False)