# In Pytorch, what is stored for backward and why?

I am building a machine learning framework and need to understand what is stored for backward. Therefore, I checked out torch.autograd.graph.saved_tensors_hooks for some insights. Here is the code that I ran:

``````import torch.nn as nn
import torch

class MyModule(nn.Module):
def __init__(self, module, name=""):
super().__init__()
self.module = module
self.name = name

def forward(self, *args, **kwargs):
def pack_hook(tensors):
print("in forward hook of", self.name, tensors.shape)
return tensors

def unpack_hook(tensors):
print("in backward hook of", self.name, tensors.shape)
return tensors

rst = self.module(*args, **kwargs)
return rst

net = nn.Sequential(
MyModule(nn.Linear(3, 5), "m1"),
MyModule(nn.Linear(5, 7), "m2"),
MyModule(nn.Linear(7, 9), "m3")
)

x = torch.randn(2, 3)
x = net(x)
``````

And here is the result:

``````in forward hook of m1 torch.Size([2, 3])
in forward hook of m2 torch.Size([5, 7])
in forward hook of m2 torch.Size([2, 5])
in forward hook of m3 torch.Size([7, 9])
in forward hook of m3 torch.Size([2, 7])
``````

After checking out tensors with shape [5, 7] and [7, 9], I found that they were actually transposed version of each linear weight respectively.
I am confused, why the weight of m1, which has shape [3, 5], was not captured by saved_tensors_hooks?
Many thanks!

The first `weight` was not stored since the `x` input does not require gradients so there is no need to store the `weight` of the first linear layer for the `dgrad` calculation.
Use `x = torch.randn(2, 3, requires_grad=True)` and the `weight` of the first layer will be stored as well.

1 Like