How Efficient Is Sparse Matrix Computation on GPU?

Hi,

PyTorch supports a few a few sparse matrix computation such as spmm. In principle, sparsity can reduce the complexity of matrix computation. So they are faster than the dense implementation on CPU for sure. However, on GPUs, these sparse operations are difficult to implement in parallel. In particular, this post Backprop Through Sparse Tensor Is Not Memory Efficient? already shows that their memory usage may be as large as dense implementation.

So I am wondering their performance on GPUs (both speed and memory) and whether I should use the sparse implementation when I encounter sparse matrices. I can provide necessary information about my use case (matrix size, sparsity level etc) if necessary. Also, I would appreciate it if there are empirical studies comparing dense and sparse functions in pytorch. Thanks.