Cosine_similarity on two huge vector (RAM error)

Hi, I’m training model on semantic_search about news title and text -> is_related(bool)

I have changed title and text into vector using transformer and want to calculate similarity.

Since torch vector(each are tensors) was so huge (142565x1024), the code below gives me an error saying “allocated more memory than is available”.

util.pytorch_cos_sim(embeddings1[i], embeddings2[i])

The code above calculate mat mul of two vectors(A,B^T) but I only need diagonal result of cos_sim, since I only need to get a similarity only between matching title and context (cos_sim[i][i]).

Is there any way to calculate without using lots of rams?

In 142565x1024, I can understand 1024 is the embedding dim. What is 142565?

If it doesn’t fit in the memory, try using lazy loading (say load just one pair at a time and compute similarity score). This will slow down the computation process (unless you parallelize it), but save RAM.

thanks :slight_smile: the sparse vector dimension is (142565 x 1024) which there’s 142565 sentence with 1024 embedding_dim

Makes sense.
Don’t do mat multiplication, instead use lazy loading with a for loop (or yield). This removes vectorization (hence memory consumption) but would increase the computation time.