For loop F.cosine_similarity make OOM

Why does the for loop calculation of F.cosine_similarity cause OOM of GPUmemory? The first for loop uses 18000 GPU memory, but the 200th loop uses 60000 GPU memory. Why is this? Please help. Thank you.

Hey!

That is a great question why the 200th loop would make a jump in memory.

cc @FFFrog for torch_npu maybe?

@albanD Thank you for mention.

Yes, this maybe bugs of torch_npu, I will figure it out.

Hey @XiXiRuPan , could you mind file a issue in this repo, since this is not a PyTorch bug but a torch_npu bug, and I will find the root cause and show it in the question