I have an MLP model and I want to save the gradient after each iteration and average it at the last. How I can do that?

model:

class MyModel(torch.nn.Module):
def __init__(self, layers_size, input_size=784, num_classes=10):
super(MyModel, self).__init__()
# create input layer
self.layers = torch.nn.ModuleList(
[torch.nn.Linear(input_size, layers_size[0])])
# stack hidden layer
self.layers.extend([torch.nn.Linear(layers_size[i - 1],
layers_size[i]) for i in range(1, len(layers_size ))])
# add output layer
self.layers.append(torch.nn.Linear(layers_size[-1], num_classes))
def forward(self, x):
# iterate over all the layers
for i, layer in enumerate(self.layers):
if i == len(self.layers):
x = func.softmax(layer(x)) # softmax for last layer
else:
x = func.relu(layer(x))
return x

Here what I like to do:

for epoch in epochs
for batch in batches:
model.forward(batch)
compute_gradients;
save(gradients)
model.backward()
avarage(gradients)

Each backward() call will accumulate the gradients in the .grad attribute of the parameters.
You could thus accumulate the gradients in your data loop and calculate the average afterwards by iterating all parameters and dividing the .grads by the number of steps.
Alternatively you could also use the autograd.grad method and manually accumulate the gradients.

grad_bank = {}
model = MyModel()
avg_counter = 0
for epoch in epochs:
for inputs, targets in batches:
opt.zero_grad()
outputs = model.batch(inputs)
loss = criterion(outputs, targets)
loss.backward()
for idx, param in enumerate(model.parameters()):
grad_bank[f"layer_{idx}"] += param.grad.data
avg_counter += 1
opt.step()
for key in grad_bank:
grad_bank[key] = grad_bank[key] / avg_counter

Is it right? Also, How to use autograd.grad method.

Yes, the usage of the .data attribute is not recommended, as it might yield unwanted side effects.
Autograd wonâ€™t be able to track this operation and will thus not be able to raise a proper error, if your manipulation is incorrect (e.g. by changing the underlying data while the computation graph used the original tensors).
If you donâ€™t want to track this operation, warp it in the no_grad() guard.

In the case we use a loss function whose attribute reduction is equal to 'mean', shouldnâ€™t av_counter be outside the batch loop ?
Also, I donâ€™t understand why the counter is inside the parameters() loop. Why should we divide each gradient by the number of layers in the case of a neural network ?

@ptrblck I have similar question, does averaging out the gradient of every batch is a good representation of model parameters? My case is I would like to use the gradient of one model as a reference for further computation in another model. So If i store the gradient after every backward() and average it out in the end. Does this represent gradient of entire model ? ( is it similar to calculating gradient had i passed entire dataset in one batch?)

No, as the gradient does not represent the parameters but the updates performed by the optimizer on the parameters.

It depends if you want to update the parameters after each backward() call.
If so, then the average of the gradients will not represent the gradient calculated using the entire dataset as the parameters were updated between each step.
Also, if your model contains e.g. batchnorm layers the normalization will be different in training mode as the batch stats will be used which will be different using the entire dataset vs. small batches.

I am trying to store the gradients of the entire model. The code is given below:

for step, batch in enumerate(train_dataloader):
outputs = model(**batch)
loss = outputs.loss
loss = loss / args.gradient_accumulation_steps
accelerator.backward(loss)
progress_bar.update(1)
progress_bar.set_postfix(loss=round(loss.item(), 3))
del outputs
gc.collect()
torch.cuda.empty_cache()
if (step+1) % args.gradient_accumulation_steps == 0 or (step+1) == len(train_dataloader):
optimizer.step()
scheduler.step()
optimizer.zero_grad()
reference_gradient = [ p.grad.view(-1) if p.grad is not None else torch.zeros(p.numel()) for n, p in
model.named_parameters()]
reference_gradient = torch.cat(reference_gradient)

My intension is to store the model parameters of entire model to used it for further calculation in another model. But I have 2 questions here,

Here the reference_gradient variable always returns 0, I understand that this happens because, optimizer.zero_grad() is called after every gradient.accumulation steps, and all the gradients are set to 0.

How can I store the model parameters of the entire model?

It seems the .grad attribute might either be None and the gradients are never calculated or more likely you are trying to store the reference gradients after calling optimizer.zero_grad() and are explicitly zeroing out the gradients.

However, on loading the model, and calculating the reference gradient, it has all tensors set to 0

import torch
model = torch.load(â€śtest.ptâ€ť)
reference_gradient = [ p.grad.view(-1) if p.grad is not None else torch.zeros(p.numel()) for n, p in model.named_parameters()]
reference_gradient = torch.cat(reference_gradient)

output : tensor([0., 0., 0., â€¦, 0., 0., 0.]) Could you please correct me, i might be missing something

The state_dict will contain all registered parameters and buffers, but not the gradients.
If you want to store the gradients, your previous approach should work in creating e.g. a list or dict and store the gradients there. Just make sure you are not zeroing them out before storing.