Is there a way to calculate cosine similarity between all combinations of embeddings?

josmi9966 · June 9, 2018, 4:12pm

If I have two tensors of shape (N, D) where D is the size of the embeddings, is there a simple, efficient
way to calculate the tensor of shape (N, N) which contains all the similarities between any pair of the N embeddings?

(Updated to correct the result shape)

royboy · June 11, 2018, 10:59pm

There’s no way to do it directly, but here’s a relatively low effort way to do this (just use cosine similarity instead of distance):

Konpat_Ta_Preechakul · June 12, 2018, 8:15am

Similarities for any pair of N embeddings should be of shape (N, N) ? Where does the last “D” come from?

Btw, I have read that if you have embeddings A, B and normalized it in such a way that the norm of each embedding equals to 1. matmul(A, B.t()) should be the cosine similarity for each pair of the embeddings?

josmi9966 · June 12, 2018, 12:53pm

Yes, sorry, that shape was wrong, corrected it.

Also, now that I think of it I think you are right and one could probably calculated this
in a way that is specific to cosine similarity.

I guess what I really was interested in is if there is an abstract operation where you have two tensors and you get a result tensor by applying a function of two parameters to all pairs of values where the values are taken along some dimension of those tensors.

So to try this again more generally, lets say you have two tensors (N, D) and (M, D) and a function from (D,D) to K, I want an operation that produces a tensor (N,M,K) from applying the function to all the pairs of D along the first axis.

This looks like such an obvious operation that I just feel it MUST exist!

praateekmahajan · April 2, 2020, 1:28am

Definitely not the efficient solution you’re looking for, but this seems to come up first on google, so I’ll write down my hacky solution for others. For me this was easy and fast enough because my num_embedding was < 10.

Just changing line 1 and printing(final) works.

embeddings = list(model.EMBEDDING_NAME.parameters())[0].detach().cpu()

final = []
for i in range(len(embeddings)):
  ans = []
  for j in range(len(embeddings)):
  	similarity = torch.cosine_similarity(embeddings[i].view(1,-1), 
  										 embeddings[j].view(1,-1)).item()
  	ans.append(similarity)
  final.append(ans)

And I used a heatmap to display what I wanted so there goes that.

import seaborn as sns
fig = plt.figure()
sns.heatmap(final, annot=True)
display(fig)