What is the most efficient way to get the closest pair of a set of vectors?

flyaway · December 20, 2019, 5:50pm

I have a matrix M in size of n*d, where n is the number of vectors and d is the dimension of each vector. Now, I want to find the closest pair of this n vectors. Be note, I only need the closest pair.
I have two ways of doing this:

Use the PairwiseDistance and argmin function. However, this method uses too much memory. Half of the returned matrix is useless.
functional.pdist works well, it does not double computing. However, this function returns a condensed vector. I do not know how to recover the (i,j) index from this condensed vector.

The input matrix M changes and need to recompute the closest pair in every iteration. So, I want the whole process to be as efficient as possible.

Any suggestions about solving this efficiently?

flyaway · December 20, 2019, 8:37pm

I solved this by recovering the (i,j) index from the k index in the functional.pdist.
The code:

def get_index(k, n):
    p = n-1
    kk = k
    j = 1
    i = 0
    while kk >= p:
        kk = kk-p
        p = p-1
        j += 1
        i += 1
    return i, j+kk

KFrank · December 20, 2019, 10:13pm

Hi Fly!

In practice, for reasonable values of n and d, I expect that
pdist will do about as well as you can.

However, the so-called computational cost of your pdist
solution is a suboptimal O(d * n^2). If you search on
closest-pair problem, you will see that your problem can
be solved in O(d * n * log n) time. However, these
“faster” algorithms don’t vectorize naturally, so you give
up much of the benefit of using a gpu (and even pipelines
in a cpu). Thus your “naive” pdist solution may well run
faster (even as it requires more operations).

Good luck.

K. Frank