Matmul in transformers

Sorry, maybe it’s a stupid question and it’s already late, how grad computed for matrix multiplied by itself from the calculus standpoint? I mean if we have matrix m x m and multiply it by itself just like in self attention in transformers, isn’t it already non-linearity without activation function?

Hi Sergey!

It’s just regular calculus. Let’s say that A is an m x m matrix. Then
(A @ A)[i, j] = sum_k (A[i, k] * A[k, j]). To compute the full
Jacobian of A @ A with respect to A, you need to evaluate the derivatives
d (A @ A)[i, j] / d A[k, l] for all values of the indices i, j, k, and l.
You have a simple sum of products, so doing so is straightforward, although
it gets a bit fussy keeping track of the indices.

Yes, it is non-linear, and you don’t have a separate activation function. But
that’s okay – there’s nothing problematic with autograd computing gradients
of non-linear expressions:

x = torch.tensor ([2.0], requires_grad = True)
(x**3).backward()

works just fine.

Best.

K. Frank

1 Like