In Pytorch, what is stored for backward and why?

I am building a machine learning framework and need to understand what is stored for backward. Therefore, I checked out torch.autograd.graph.saved_tensors_hooks for some insights. Here is the code that I ran:

import torch.nn as nn
import torch


class MyModule(nn.Module):
    def __init__(self, module, name=""):
        super().__init__()
        self.module = module
        self.name = name

    def forward(self, *args, **kwargs):
        def pack_hook(tensors):
            print("in forward hook of", self.name, tensors.shape)
            return tensors

        def unpack_hook(tensors):
            print("in backward hook of", self.name, tensors.shape)
            return tensors

        with torch.autograd.graph.saved_tensors_hooks(pack_hook, unpack_hook):
            rst = self.module(*args, **kwargs)
            return rst


net = nn.Sequential(
    MyModule(nn.Linear(3, 5), "m1"),
    MyModule(nn.Linear(5, 7), "m2"),
    MyModule(nn.Linear(7, 9), "m3")
)


x = torch.randn(2, 3)
x = net(x)

And here is the result:

in forward hook of m1 torch.Size([2, 3])
in forward hook of m2 torch.Size([5, 7])
in forward hook of m2 torch.Size([2, 5])
in forward hook of m3 torch.Size([7, 9])
in forward hook of m3 torch.Size([2, 7])

After checking out tensors with shape [5, 7] and [7, 9], I found that they were actually transposed version of each linear weight respectively.
I am confused, why the weight of m1, which has shape [3, 5], was not captured by saved_tensors_hooks?
Many thanks!

The first weight was not stored since the x input does not require gradients so there is no need to store the weight of the first linear layer for the dgrad calculation.
Use x = torch.randn(2, 3, requires_grad=True) and the weight of the first layer will be stored as well.

1 Like