Cdist for a large number of pairs

Hi,

This is a very generic question.. i need to calculate the pairwise distance matrix for a large vector (± 150000), which upfront is -of course- complaining about the ram even to preallocate it, an probably the calculation time will be high too.

Is there a native way in pytorch to calculate it, like, “by batch”, or similar? (ignoring for now the fact that the variable holding the resulting matrix will also be too big)

Thanks in advance!

Hi Ghostv!

As you note, you won’t be able to fit the final result in any but the largest of gpu memories.
Assuming that you store the resulting distances as four-byte floats, the result would take
about 45 GBytes, just for itself.

If you’re performing such a computation many times over, it could become overly costly in
terms of time, but computing one once should be well within reach of a gpu.

I’m not aware of any “batch” version of torch.cdist() that is built into pytorch. But it is unlikely
that any such scheme would give you any computation-time speed-up (short of distributing the
batches across multiple gpus), as one assumes that torch.cdist() is already well optimized.

Depending on what you are going to do with the result, you could avoid memory constraints
by batching the computation by hand.

As a hypothetical, if all you need is, say, the pair of points that are closest to one another, you
could, for example, break your vector up into 150 chunks, each with 1000 elements. You would
perform about 150**2 / 2 cdist() computations – quite manageable – on pairs of length-1000
vectors – also quite manageable. Store the minimum-distance pair of points from the current
cdist() computation, release the memory occupied by the cdist() result, and go on to the
next pair of chunks. You can either keep track of the minimum-distance point pair as you go
along, or store them all and extract the minimum-distance point pair after you’ve completed all
of the cdist() computations.

This way you would avoid the memory limitations, but you have to “batch” whatever it is that
you do with the result of the cdist() computation, as well as the cdist() computation itself.

Best.

K. Frank

1 Like

Thanks for the answer mate! And yeah.. i was trying to avoid that approach to go through a more standard logic, but yeah, i guess i’ll have to do it another way.

What i’m trying to do, is to find, from a point cloud -(n,3) array- all the points that have at least one point within a defined distance and then, either build a concave hull out of all-of-them or clusterize them first and then hull them. It’s basically a dbscan which i was trying to optimize.

For a small amount of points, a cdist + where + a loop, clusterizes them perfectly and almost instantly, but for large datasets, i can’t get through the distance calculation.

What would be a good approach to achieve this?

Thanks in advance!

Not sure if the following helps you.

Since you mentioned about 3D point cloud, maybe Open3D functions could fit your use case?

Eg. remove_radius_outliers(nb_points: int, search_radius: float)

compute_convex_hull(joggle_inputs: bool = False)

Hi Ghostv!

First, a general comment: With modern hardware and things like pytorch, keeping the gpu
and / or cpu floating-point pipelines full can be more important for reducing computation
time than reducing the number of floating-point operations.

With that in mind, let me start with some suggestions that likely won’t be helpful for you.

Rather than compute the distance of each point from every other point (and then select for
points that have at least one nearby neighbor), you could sequentially compare a given
point with the other points, but stop if and when you find a point that is nearby. If many
points have many nearby neighbors this will “short-circuit” many of your floating-point
operations. (If most points have only a few nearby neighbors, it won’t do you much good.)
Getting fancier, let’s say that one unit is your nearness threshold. If point A is ten units from
point B, and point B is three units from point C, then points A and C can be no closer than
seven units from one another so you know that they don’t meet your nearness threshold,
even without computing the distance.

The problem is that such schemes use a fair amount of if-then programming and will
interrupt the flow of your floating-point pipelines.

Now let’s look at a scheme that could work (depending on the details of how the points
in your point cloud are distributed).

Sort (descending) the order of the n dimension of your [n, 3] tensor by the x coordinates
of the points. (By this I mean that cloud_tensor[8, 0] is the x coordinate of the index-8
point in the cloud.) Think of the sorted cloud_tensor as consisting of 150 chunks of 1000
points each. If the last (and therefore the smallest) x value in chunk 1 is, say, 2 greater than
the first (and therefore largest) x in chunk 3, you only have to compute cdist() for chunk 1
with chunks 1 and 2. (Of course this will only be helpful if your points are mostly spread out
relative to you nearness threshold. If most points are nearby most other points, then this logic
would have you computing cdist() for most chunks with most other chunks, offering little
savings.)

Is some sense such a scheme is applying the if-then code to how batches are related to one
another and then running the floating-point code on batches that are large enough to keep the
pipelines full.

Best.

K. Frank

I thought of something like that already, but yeah, at the end of the day, it comes to “previously knowing” how to pre-cluster the data and if the data is already pre-clustered with that distance.

I’m gonna take a look at Open3D, but was hoping for a “pure torch” solution (i always try to abstract the problems as much as i can).