Optimizer.zero_grad() in Gradient Accumulation

Hi everyone!
I’m new with gradient accumulation, so should I make zero_grad() on the start of each epoch?

       #optimizer.zero_grad()   <-----------------Should I make this on the beggining every epoch?
        for step in range(total_steps):
            indices = self.dataloader_dict[phase].dataset.get_train_indices()        
            new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
            self.dataloader_dict[phase].batch_sampler.sampler = new_sampler

                images, captions = next(iter(self.dataloader_dict[phase]))

            images = images.to(self.device)
            captions = captions.to(self.device)

            with torch.set_grad_enabled(phase == 'train'):

                features = encoder(images)
                features = features.to(self.device)
                outputs = decoder(features, captions)

                loss = self.criterion(outputs.view(-1, vocab_size), captions.view(-1))

                if phase == 'train':
                    torch.nn.utils.clip_grad_norm_(decoder.parameters(), 1.0)
                    if (step+1)%self.grad_acumulation_step == 0:    
                    # writting weights and grads to tensorboard's histogram    
                    for name, weight in decoder.named_parameters():
                        self.tb.add_histogram(name, weight, step)
                        #self.tb.add_histogram(f'{name}.grad', weight.grad, step) 
                elif step%(total_steps//5)==0:
                    examples = self.add_examples(captions, outputs, phase)
                    self.tb.add_text(f'{phase}:ground_truth/predictions', examples, step)

            running_loss += loss.item() * features.size(0)
            bleu4 = self.compute_metric(captions, outputs)

In general your code when doing gradient accumulation should look like -

accumulations = 2
scaled_loss = 0
epochs = 20
training_steps_losses = []
for epoch in range(epochs):
    for idx, (data, target) in train_loader:
        outputs = network(data)
        loss = criterion(outputs, target)
        loss /= accumulations

        # Here you will calculate gradients.
        # In usual case we call optimizer.step() right after this. But not in this case.
        # We are dividing the total_loss by accumulations in order to have same scale of gradients
        # before calling optimizer.step()
        # In this case we will only call optimizer.step() when batch index (idx) + 1
        # is divisible by accumulations.
        # The main idea is we call .backward() for accumulations number of times,
        # doing this adds gradients for all the parameters #(since we are not calling optimizer.zero_grad() every time we call total_loss.backward())
        # for accumulations number of times.
        # And after that we call optimizer.step_grad() followed by optimizer.zero_grad()

        if (idx + 1) % accumulations == 0:
            optimizer.zero_grad() # here we will call zero grad

Where you call zero grad after calling step on optimizer.

Hope that helps.

1 Like

Thanks, I’d like to try with dividing loss to acc steps and without and compare the difference :slight_smile:

Yeah, sometimes without dividing may provide you better convergence because gradients would be large. But also if the model is kinda big then there’s high chance for encountering nan. You can choose which works best in your use case.