Sorry, maybe it’s a stupid question and it’s already late, how grad computed for matrix multiplied by itself from the calculus standpoint? I mean if we have matrix m x m and multiply it by itself just like in self attention in transformers, isn’t it already non-linearity without activation function?

Hi Sergey!

It’s just regular calculus. Let’s say that `A`

is an `m x m`

matrix. Then

`(A @ A)[i, j] = sum_k (A[i, k] * A[k, j])`

. To compute the full

Jacobian of `A @ A`

with respect to `A`

, you need to evaluate the derivatives

`d (A @ A)[i, j] / d A[k, l]`

for all values of the indices `i`

, `j`

, `k`

, and `l`

.

You have a simple sum of products, so doing so is straightforward, although

it gets a bit fussy keeping track of the indices.

Yes, it is non-linear, and you don’t have a separate activation function. But

that’s okay – there’s nothing problematic with autograd computing gradients

of non-linear expressions:

```
x = torch.tensor ([2.0], requires_grad = True)
(x**3).backward()
```

works just fine.

Best.

K. Frank

1 Like