How to save the gradient after each batch (or epoch)?

I have an MLP model and I want to save the gradient after each iteration and average it at the last. How I can do that?


class MyModel(torch.nn.Module):

    def __init__(self, layers_size, input_size=784, num_classes=10):

        super(MyModel, self).__init__()

        # create input layer
        self.layers = torch.nn.ModuleList(
                    [torch.nn.Linear(input_size, layers_size[0])])
        # stack hidden layer
        self.layers.extend([torch.nn.Linear(layers_size[i - 1], 
                        layers_size[i]) for i in range(1, len(layers_size ))])
        # add output layer
        self.layers.append(torch.nn.Linear(layers_size[-1], num_classes))
    def forward(self, x):
        # iterate over all the layers
        for i, layer in enumerate(self.layers):
            if i == len(self.layers):
                x = func.softmax(layer(x)) # softmax for last layer
                x = func.relu(layer(x))
        return x

Here what I like to do:

for epoch in epochs
	for batch in batches:


Thanks in advance.

Each backward() call will accumulate the gradients in the .grad attribute of the parameters.
You could thus accumulate the gradients in your data loop and calculate the average afterwards by iterating all parameters and dividing the .grads by the number of steps.
Alternatively you could also use the autograd.grad method and manually accumulate the gradients.

Thank you @ptrblck,

Here what I did now,

grad_bank = {}
model = MyModel()
avg_counter = 0

for epoch in epochs:
    for inputs, targets in batches:
        outputs = model.batch(inputs)
        loss = criterion(outputs, targets)

        for idx, param in enumerate(model.parameters()):
            grad_bank[f"layer_{idx}"] +=
            avg_counter += 1


for key in grad_bank:
    grad_bank[key] = grad_bank[key] / avg_counter

Is it right? Also, How to use autograd.grad method.

Thanks for help.

The loop looks correct. I would recommend not to use the .data attribute and if necessary wrap the code in a with torch.no_grad() block.

Thanks, @ptrblck,

Will .data create some problem? Is there something I should know?


Yes, the usage of the .data attribute is not recommended, as it might yield unwanted side effects.
Autograd won’t be able to track this operation and will thus not be able to raise a proper error, if your manipulation is incorrect (e.g. by changing the underlying data while the computation graph used the original tensors).
If you don’t want to track this operation, warp it in the no_grad() guard.

1 Like

In the case we use a loss function whose attribute reduction is equal to 'mean', shouldn’t av_counter be outside the batch loop ?
Also, I don’t understand why the counter is inside the parameters() loop. Why should we divide each gradient by the number of layers in the case of a neural network ?