Hello. I’m quite new here, so kind explanation will be really appreciated.
I have a data set that consists of two smaller data sets.
Set A : A_1, A_2, … , A_n
Set B : B_1, B_2, … , B_m
Here, m is larger than n.
I’m trying to hire torch.utils.data.DataLoader() to make one-size mini-batches from above data.
What I want is, each mini-batch be as follows.
(A_1, B_1)
(A_2, B_2)
…
(A_n, B_n)
(A_1, B_n+1)
…
(A_whatever, B_m)
When I run it, the order of A is perfect.
But problem is, the order of B be random.
Only options I used are mini-batch size and number of threads.
(And of course, these do not effects to my problem, right?)
I do not add any options: sampler, shuffle, etc.
How can I maintain the order of original data sets?
How are you using DataLoader exactly?
I’d say your options are either:
- Create a new dataset class that will load both A and B
or
- Concatenate your datasets and make a data sampler class that will sequentially load from A and B.
Option 1 would be more hard-coded to your specific problem, and I think option 2 could give you more flexibility (or you could do some mixture of the above). Here’s a potential way you could do it:
class SequentialSamplerAugdata(torch.utils.data.Sampler):
def __init__(self, data_source):
self.data_source = data_source
def __iter__(self):
n = len(self.data_source) // 2
seq_A = range(n)
seq_B = (x + n for x in seq_A)
return iter(itertools.chain(*zip(seq_A, seq_B)))
def __len__(self):
return len(self.data_source)
both_sets = torch.utils.data.ConcatDataset((data_A, data_B))
both_sets_loader = torch.utils.data.DataLoader(both_sets,
batch_size=2,
sampler=SequentialSamplerAugdata(both_sets),
shuffle=False)