Necssity of repeating epochs

aswamy · March 11, 2021, 9:42pm

I saw a code like below

for epoch in range(num_epochs):
            batch = 0
            for _ in range(num_epoch_repeats):
                for data in self.train_data_loader:
                    losses = self.train_step(data, global_step=step_id)
                    ....................................................
                    ....................................................

I do not understand the necessity of this loop "for _ in range(num_epoch_repeats):" instead you can increase the num_epochs, right?

Are there any other reasons for this kind of implementation?

Thanks
Anil

googlebot · March 11, 2021, 10:23pm

You should see how authors use the outer loop - likely they do some periodic actions, and decided to change epoch numeration for that (e.g. there are validation epochs every 10 training epochs, and num_epochs counts the former). Not very good idea overall, IMHO.

aswamy · March 11, 2021, 10:30pm

Yeah, I checked, It seems that learning rate is updated(self.lr_scheduler.step()) for every self.num_epoch_repeats But this would just be the same as counting on epochs and then performing self.lr_scheduler.step() for every required number count. Right?

        for epoch in range(self.num_epochs):
            self.writer.add_scalar(
                "lr", self.optim.param_groups[0]["lr"], global_step=step_id
            )

            batch = 0
            for _ in range(self.num_epoch_repeats):
                for data in self.train_data_loader:
                    losses = self.train_step(data, global_step=step_id)
                    loss_str = fmt_loss_str(losses) 
                    if (
                        batch == self.num_total_batches - 1
                        or batch % self.accu_grad == self.accu_grad - 1
                    ):
                        self.optim.step()
                        self.optim.zero_grad()

                    self.post_batch(epoch, batch)
                    step_id += 1
                    batch += 1
                    progress.update(1)
            if self.lr_scheduler is not None:
                self.lr_scheduler.step()

I am just wondering is there any other reason.

googlebot · March 11, 2021, 10:46pm

seems that they also do it for gradient accumulation - as it interacts with the data loader a bit poorly (if dataset_size : batch_size*accum_batches ratio is low)