I have an MLP model and I want to save the gradient after each iteration and average it at the last. How I can do that?

model:

class MyModel(torch.nn.Module):
def __init__(self, layers_size, input_size=784, num_classes=10):
super(MyModel, self).__init__()
# create input layer
self.layers = torch.nn.ModuleList(
[torch.nn.Linear(input_size, layers_size[0])])
# stack hidden layer
self.layers.extend([torch.nn.Linear(layers_size[i - 1],
layers_size[i]) for i in range(1, len(layers_size ))])
# add output layer
self.layers.append(torch.nn.Linear(layers_size[-1], num_classes))
def forward(self, x):
# iterate over all the layers
for i, layer in enumerate(self.layers):
if i == len(self.layers):
x = func.softmax(layer(x)) # softmax for last layer
else:
x = func.relu(layer(x))
return x

Here what I like to do:

for epoch in epochs
for batch in batches:
model.forward(batch)
compute_gradients;
save(gradients)
model.backward()
avarage(gradients)

Each backward() call will accumulate the gradients in the .grad attribute of the parameters.
You could thus accumulate the gradients in your data loop and calculate the average afterwards by iterating all parameters and dividing the .grads by the number of steps.
Alternatively you could also use the autograd.grad method and manually accumulate the gradients.

grad_bank = {}
model = MyModel()
avg_counter = 0
for epoch in epochs:
for inputs, targets in batches:
opt.zero_grad()
outputs = model.batch(inputs)
loss = criterion(outputs, targets)
loss.backward()
for idx, param in enumerate(model.parameters()):
grad_bank[f"layer_{idx}"] += param.grad.data
avg_counter += 1
opt.step()
for key in grad_bank:
grad_bank[key] = grad_bank[key] / avg_counter

Is it right? Also, How to use autograd.grad method.

Yes, the usage of the .data attribute is not recommended, as it might yield unwanted side effects.
Autograd wonâ€™t be able to track this operation and will thus not be able to raise a proper error, if your manipulation is incorrect (e.g. by changing the underlying data while the computation graph used the original tensors).
If you donâ€™t want to track this operation, warp it in the no_grad() guard.

In the case we use a loss function whose attribute reduction is equal to 'mean', shouldnâ€™t av_counter be outside the batch loop ?
Also, I donâ€™t understand why the counter is inside the parameters() loop. Why should we divide each gradient by the number of layers in the case of a neural network ?