Dataloader length inconsistent

yuyanruan179 · April 22, 2022, 3:04pm

Hi there, I was stuck and found something very very abnormal. I am training an image classifier

def train(self, current_epoch=0, is_init=False, fold=0):
    self.logger.info('Current fold: %d' %(fold))

    self.model.train()
    if (not is_init):
        self.scheduler.step()
    self.logger.info('Current epoch learning rate: %e' %(self.optimizer.param_groups[0]['lr']))

    running_loss = 0.0
    running_acc = 0.0
    dataloader = self.dataloader[str(fold)]
    optimizer = torch.optim.Adam(self.model.parameters(), lr=0.0001, weight_decay=0.0001)
    self.logger.info('Len of train loader: '+str(len(dataloader['train'])))
    for i, sample in enumerate(dataloader['train'], 1):
        self.logger.info("i: {}".format(i))
        sample = self.prepare(sample)
        images, labels = sample['img'], sample['label']
        optimizer.zero_grad()
        outputs = self.model(images)
        loss = self.loss_fn(outputs, labels)
        loss.backward()
        optimizer.step()

I found that the output log very strange.

As shown in the picture above, we can see that the first epoch only gets 2 batches, while the second one gets far more than that. In fact, it gets 12 and epoch 3 gets 15 batches.
Shouldn’t it be the same number of batches? Why are they not 129 as len(dataloader) is 129?

You may find the entire project here. Project

Sincerely thank you in advance. So frustrating…

ptrblck · April 23, 2022, 12:40am

Could you try to come up with a standalone, minimal, executable code snippet showing this behavior, please (without the need to download the actual dataset and all files)?