Transformers backpropagation

andreasceid · June 22, 2022, 10:10pm

How is backpropagation applied in transformers? It is my understanding that in an encoder cell there are 2 different fully connected NNs:

One is for the query embedding
And the other is for the value embedding

Those 2 NNs output an augmented version of the query and value embeddings. Those 2 vectors are then multiplied (using the dot product) to compute a similarity score between them. My question is:

How is the backpropagation algorithm expanded in this case? Do both the query NN and the value NN have the same loss backpropagated? If the same loss is backpropagated, then why use 2 different neural networks for that?