Difference in backprop depending on whether weighted attention is used or not when using Cosine similarity loss

Jo-won · July 6, 2022, 3:05am

I have a model for continuous learning between triplets. I leave you with questions about how this model is learned.

In the learning process, features are extracted from the backbone network for the triplet (anchor, positive, negative) and the weights are multiplied for each feature.
These weights are calculated individually by a model named A taking the triplet feature as an input.
From these weighted triplet features, calculate the cosine similarity between anchor-positive and anchor-negative pairs.
The calculated cosine similarity is converted into refined similarity scores through a model called B, and the model is learned by the triplet margin loss through these two.

Only the model A and model B are backpropagated throughout the entire process, and all others are frozen.

My question here is number 3. The cosine similarity calculation process includes L2 normalization of each feature anyway, so it was confirmed that even if each feature is multiplied by weight, it will return to before the multiplication. (Because cosine similarity is related to the angle between the two vectors, regardless of scale.)

So I thought that this process of multiplying each feature by weight would have no effect on learning because of the cosine similarity. However, contrary to my opinion, calculating weight through model A and multiplying each feature in the learning process showed better performance when evaluating than when excluding this process. As a result of checking specifically, the degree to which model B was learned has changed in the presence or absence of weight multiplication process through model A. The current assumption is that the weight multiplication process from model A was meaningless during forwarding during learning, but in the backward process, there were more gradients involved with the chain rule, indicating the direction of model B learning more specifically

Can anyone tell me what the actual performance improvement is?