How is backpropagation applied in transformers? It is my understanding that in an encoder cell there are 2 different fully connected NNs:
- One is for the query embedding
- And the other is for the value embedding
Those 2 NNs output an augmented version of the query and value embeddings. Those 2 vectors are then multiplied (using the dot product) to compute a similarity score between them. My question is:
How is the backpropagation algorithm expanded in this case? Do both the query NN and the value NN have the same loss backpropagated? If the same loss is backpropagated, then why use 2 different neural networks for that?