Sorry, maybe it’s a stupid question and it’s already late, how grad computed for matrix multiplied by itself from the calculus standpoint? I mean if we have matrix m x m and multiply it by itself just like in self attention in transformers, isn’t it already non-linearity without activation function?
Hi Sergey!
It’s just regular calculus. Let’s say that A
is an m x m
matrix. Then
(A @ A)[i, j] = sum_k (A[i, k] * A[k, j])
. To compute the full
Jacobian of A @ A
with respect to A
, you need to evaluate the derivatives
d (A @ A)[i, j] / d A[k, l]
for all values of the indices i
, j
, k
, and l
.
You have a simple sum of products, so doing so is straightforward, although
it gets a bit fussy keeping track of the indices.
Yes, it is non-linear, and you don’t have a separate activation function. But
that’s okay – there’s nothing problematic with autograd computing gradients
of non-linear expressions:
x = torch.tensor ([2.0], requires_grad = True)
(x**3).backward()
works just fine.
Best.
K. Frank
1 Like