How to remember the order of the data coming from a dataloader?

PytorchBeginner · March 16, 2020, 4:23pm

The topic may be confusing. In each epoch, I need to call tqdm two times. Let me explain the situation in detail.

for data in tqdm(self.dataloader['train'],leave=False,total=len(self.dataloader['train'])):
    self.input=data[0].to(self.device)
    scale1_code_vectors=AFunction(self.input)
    all_scale1_code_vectors.append(scale1_code_vectors.data.cpu())

all_scale1_code_vectors=torch.cat(all_scale1_code_vectors,dim=0).to(self.device)
return all_scale1_code_vectors

AllTargetDistribution=SoftAssign(all_scale1_code_vectors)

As indicated above, I use self.dataloader[‘train’] to generate all_scale1_code_vectors. Then I need to generate code_vectors in a batch-wise manner like:

for data in tqdm(...)
    batch_code_vectors=AFunction(self.input)
    batchTargetDistribution=SoftAssign(batch_code)
    batchTargetCorresponding=AllTargetDistribution(batchsize*ind:batchsize*(ind+1))
    loss=criterion(batchTargetDistribution,batchTargetCorresponding)

That is, I need to guarantee the input generated in the second tqdm is same as the one of the first time. Currently, I just set shuffle=False in dataset setting, but this is not a good idea (may overfitting). So when shuffle is set as True, how to use dataloader[‘train’] two times in one epoch to guarantee the correspondance of iterative data?

ayalaa2 · March 16, 2020, 5:59pm

I would suggest the use of a Sampler. This is an optional argument when initializing a DataLoader.

In general, it’s expected to yield sample indices to the dataloader. A very straight forward example would be something like this:

class CustomSampler(Sampler):
    def __init__(self, data_size):
        self.samples = [i for i in range(data_size)]
    
    def __len__(self):
        return len(self.samples)
    
    def __iter__(self):
        for sample in self.samples:
            yield sample

Here it simply generates a list of all possible indices and yields them 1 at a time in order. What you could do is randomly shuffle the list initially and use the same sampler when you wish to preserve order. Here would be an example:

class CustomSampler(Sampler):
    def __init__(self, data_size):
        self.samples = [i for i in range(data_size)]
        random.shuffle(self.samples)
    
    def __len__(self):
        return len(self.samples)
    
    def __iter__(self):
        for sample in self.samples:
            yield sample

Notice that the shuffling is done upon initialization. So if you created two dataloaders using the same instance of CustomSampler, the order of samples will be the same. I would recommend reading the doc for more info