Compute Neural Network Gradients

Hi! After reading the material below, I have a question.

According to ch2 useful identities (5), (6) in the pdf above, the formula is defined as follows.

(5) Matrix times column vector with respect to the matrix

(6) Row vector time matrix with respect to the matrix

image

I was wondering if this would actually be true, so I implemented it with pytorch

import random
import numpy as np
import torch


def set_seed(seed: int = 42):
    """Seed fixer (random, numpy, torch)
    Args:
        seed (:obj:`int`): The seed to set.
    """
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)  # if use multi-GPU
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    
    
set_seed()

# (5) column vector case
x = torch.randn(5, 1).requires_grad_()
W = torch.randn(3, 5).requires_grad_()

y = W @ x

torch.isclose(
    torch.ones_like(y) @ x.T, # delta * x^T
    torch.autograd.grad(y, W, torch.ones_like(y), retain_graph=True)[0]   
)

# (6) row vector case
x = torch.randn(1, 2).requires_grad_()
f = torch.nn.Linear(2, 3, bias=False).requires_grad_()

y = f(x)
torch.autograd.backward(y, torch.ones_like(y), retain_graph=True)

assert torch.isclose(
    x.T @ torch.ones_like(y), # x^T * delta
    # Since 
nn.Linear in torch stores weights as transpose
    torch.autograd.grad(y, f.weight, torch.ones_like(y), retain_graph=True)[0].T 
)

In my results, the row vector case was the same as the pdf, but the column vector case was different.

And even when I understood it as a formula, I did not understand it.

In (5), delta means \frac{\partial J}{\partial z}, which will be a column vector. In order to perform outer product, it is necessary to calculate column vector X row vector, but it is strange to take transpose.

Is the formula wrong? Or is the implementation wrong?