How to save the gradient after each batch (or epoch)?

I have an MLP model and I want to save the gradient after each iteration and average it at the last. How I can do that?

model:

class MyModel(torch.nn.Module):

    def __init__(self, layers_size, input_size=784, num_classes=10):

        super(MyModel, self).__init__()

        # create input layer
        self.layers = torch.nn.ModuleList(
                    [torch.nn.Linear(input_size, layers_size[0])])
        
        # stack hidden layer
        self.layers.extend([torch.nn.Linear(layers_size[i - 1], 
                        layers_size[i]) for i in range(1, len(layers_size ))])
        
        # add output layer
        self.layers.append(torch.nn.Linear(layers_size[-1], num_classes))
        
    def forward(self, x):
        # iterate over all the layers
        for i, layer in enumerate(self.layers):
            if i == len(self.layers):
                x = func.softmax(layer(x)) # softmax for last layer
            else:
                x = func.relu(layer(x))
        return x

Here what I like to do:

for epoch in epochs
	for batch in batches:
		model.forward(batch)
		compute_gradients;
		save(gradients)
		model.backward()

avarage(gradients)

Thanks in advance.

Each backward() call will accumulate the gradients in the .grad attribute of the parameters.
You could thus accumulate the gradients in your data loop and calculate the average afterwards by iterating all parameters and dividing the .grads by the number of steps.
Alternatively you could also use the autograd.grad method and manually accumulate the gradients.

Thank you @ptrblck,

Here what I did now,

grad_bank = {}
model = MyModel()
avg_counter = 0

for epoch in epochs:
    for inputs, targets in batches:
        opt.zero_grad()
        outputs = model.batch(inputs)
        loss = criterion(outputs, targets)
        loss.backward()

        for idx, param in enumerate(model.parameters()):
            grad_bank[f"layer_{idx}"] += param.grad.data
            avg_counter += 1

        opt.step()

for key in grad_bank:
    grad_bank[key] = grad_bank[key] / avg_counter

Is it right? Also, How to use autograd.grad method.

Thanks for help.

The loop looks correct. I would recommend not to use the .data attribute and if necessary wrap the code in a with torch.no_grad() block.

1 Like

Thanks, @ptrblck,

Will .data create some problem? Is there something I should know?

Thanks!

Yes, the usage of the .data attribute is not recommended, as it might yield unwanted side effects.
Autograd won’t be able to track this operation and will thus not be able to raise a proper error, if your manipulation is incorrect (e.g. by changing the underlying data while the computation graph used the original tensors).
If you don’t want to track this operation, warp it in the no_grad() guard.

2 Likes

In the case we use a loss function whose attribute reduction is equal to 'mean', shouldn’t av_counter be outside the batch loop ?
Also, I don’t understand why the counter is inside the parameters() loop. Why should we divide each gradient by the number of layers in the case of a neural network ?

@ptrblck I have similar question, does averaging out the gradient of every batch is a good representation of model parameters? My case is I would like to use the gradient of one model as a reference for further computation in another model. So If i store the gradient after every backward() and average it out in the end. Does this represent gradient of entire model ? ( is it similar to calculating gradient had i passed entire dataset in one batch?)

No, as the gradient does not represent the parameters but the updates performed by the optimizer on the parameters.

It depends if you want to update the parameters after each backward() call.
If so, then the average of the gradients will not represent the gradient calculated using the entire dataset as the parameters were updated between each step.
Also, if your model contains e.g. batchnorm layers the normalization will be different in training mode as the batch stats will be used which will be different using the entire dataset vs. small batches.

Thank you for the quick reply @ptrblck

I am trying to store the gradients of the entire model. The code is given below:

  for step, batch in enumerate(train_dataloader): 
    outputs = model(**batch)
    loss = outputs.loss
    loss = loss / args.gradient_accumulation_steps
    accelerator.backward(loss)
    progress_bar.update(1)
    progress_bar.set_postfix(loss=round(loss.item(), 3))
    del outputs
    gc.collect()
    torch.cuda.empty_cache()
    
    if (step+1) % args.gradient_accumulation_steps == 0 or (step+1) == len(train_dataloader):
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
 reference_gradient = [ p.grad.view(-1) if p.grad is not None else torch.zeros(p.numel()) for n, p in 
 model.named_parameters()]
 reference_gradient = torch.cat(reference_gradient)

My intension is to store the model parameters of entire model to used it for further calculation in another model. But I have 2 questions here,

  1. Here the reference_gradient variable always returns 0, I understand that this happens because, optimizer.zero_grad() is called after every gradient.accumulation steps, and all the gradients are set to 0.
  2. How can I store the model parameters of the entire model?
  1. It seems the .grad attribute might either be None and the gradients are never calculated or more likely you are trying to store the reference gradients after calling optimizer.zero_grad() and are explicitly zeroing out the gradients.

  2. You could store the state_dict of the model.

I tried storing the state_dict of the model @ptrblck

torch.save(unwrapped_model.state_dict(),“test.pt”)

However, on loading the model, and calculating the reference gradient, it has all tensors set to 0

import torch
model = torch.load(“test.pt”)
reference_gradient = [ p.grad.view(-1) if p.grad is not None else torch.zeros(p.numel()) for n, p in model.named_parameters()]
reference_gradient = torch.cat(reference_gradient)

output : tensor([0., 0., 0., …, 0., 0., 0.]) Could you please correct me, i might be missing something

The state_dict will contain all registered parameters and buffers, but not the gradients.
If you want to store the gradients, your previous approach should work in creating e.g. a list or dict and store the gradients there. Just make sure you are not zeroing them out before storing.