Pairwise_distance memory allocation / functioning

I have an issue with the function pairwise_distance from torch.nn.functional.
Unfortunately, I didn’t find any hint looking at the wiki/code implementation.
I would like to know how much memory it uses to compute the distance.

I have a pretty large tensor of shape: A = [5056512×381] which occupies around 7GB in GPU. When I call F.pairwise_distance between a row of A (A[i]) and A my GPU goes out of memory and I don’t know why.
Some notes:
I launch: torch.cuda.empy_cache() right before the distance computation and I have the GPU almost free: 0.94GB allocated and 3GB reserved and my GPU has 24GB.

The error is the following:
CUDA out of memory. Tried to allocate 7.18 GiB (size of tensor A) but 15.30 GiB was already allocated.
It seems that allocates the memory of A 3 times! Any hint?
(Since it is really slow to perform this computation in CPU I would like to keep it in GPU and do it in the shortest time)
Thank you very much, every suggestion is appreciated!