Is All Pairs Shortest Path with PyTorch / CUDA possible?

joshualmitchell · February 27, 2019, 5:24pm

I don’t know enough about the implementation details to know whether this is a good idea, but would the following be possible / feasible / efficient?

Imagine you have a big graph G = (V, E), and you want to find the shortest path from a to b for all a, b in V. Since this is embarrassingly parallel and has tons of individual threads, is there a known way to use PyTorch/CUDA to find all these paths?

gapcoder · March 17, 2019, 6:55pm

In my opinion, it is possible, not feasible, not that efficient.

Textbooks algorithms for all-pairs shortest paths imply that either you use a dynamic programming approach or a particular version of matrix multiplication. https://en.wikipedia.org/wiki/Floyd–Warshall_algorithm

In both cases, you work with matrices/tensors; it is doable with PyTorch.

But you don’t need to calculate gradients or use calculation graphs to solve the problem, that makes it not that feasible and efficient for PyTorch.

One stop library for graph algorithms is Networkx:
https://networkx.github.io/documentation/stable/reference/algorithms/shortest_paths.html

If your graphs don’t fit in memory, take a look at Spark.GraphX: https://spark.apache.org/graphx/ it is specifically designed to distribute the graphs, and work in Pregel-like (message passing) algorithm design.

Jakub_Adamczyk · April 25, 2024, 3:32pm

I know this is an old question, but it shows as one of the first results in Google, so I will correct the existing answer.

Computing APSP on GPU is possible, feasible and very efficient, and it’s a well-researched topic. There are dedicated algorithms for this, most prominently all-pair shortest-paths R-Kleene algorithm, see [1] and [2]. It uses parallelized, in-place version of fast matrix multiplication routines, rendering APSP problem as a set of recursive matrix-matrix multiplications on a tropical closed semiring. The basic version requires keeping the distance matrix in memory, but is extremely fast, orders of magnitude faster than typical CPU-based methods. But in [3] and [4] authors show that if you use condensed adjacency lists as a graph representation, you can use BFS-style algorithm, you can scale this up further, at the cost of random access and overall slower computation. There are also parallelized variants of Floyd-Warshall algorithm that work well on GPUs, if you implement them well.

None of those papers are implemented in PyTorch and publicly available, as far as I’m aware. There are CUDA-based versions on GitHub, using various approaches:

[1] D’alberto, Paolo, and Alexandru Nicolau. “R-Kleene: A high-performance divide-and-conquer algorithm for the all-pair shortest path for densely connected networks.” Algorithmica 47 (2007): 203-213.
[2] Buluç, Aydın, John R. Gilbert, and Ceren Budak. “Solving path problems on the GPU.” Parallel Computing 36.5-6 (2010): 241-253.
[3] Czech, Wojciech, and David A. Yuen. “Efficient graph comparison and visualization using gpu.” 2011 14th IEEE International Conference on Computational Science and Engineering . IEEE, 2011.
[4] Harish, Pawan, and Petter J. Narayanan. “Accelerating large graph algorithms on the GPU using CUDA.” International conference on high-performance computing . Berlin, Heidelberg: Springer Berlin Heidelberg, 2007.