The Universal Sparse Tensor

We are excited to introduce a Beta version of the Universal Sparse Tensor (UST) in NVIDIA nvmath-python:

Besides the online documentation, we also posted two blog postings on the UST:

The UST uses a tensor format DSL (Domain Specific Language) to describe how the sparse tensor should be represented and uses type polymorphism on a small set of base operations to define the vast space of instances for these operations. Developers merely focus on the sparsity of a tensor. The implementation in nvmath-python provides (1) zero-cost interoperability with PyTorch and (2) transparent injection into PyTorch models.

Regarding (1), a PyTorch tensor can be converted to a UST tensor and back without data movement or copying. Once converted, all the UST features become available to the original PyTorch tensor, as illustrated below.

from nvmath.sparse.ust import Tensor

a = torch.eye(10, 10, dtype=torch.float64).to_sparse_csr()

u = Tensor.from_package(a)    # u is now a UST representation of a
print(u)
# ...
# format : [i, j] -> (i: <LevelFormat.DENSE>, j: <LevelFormat.COMPRESSED>)
# ...
# ... you can now use all the UST features, like printing, drawing, and
#     polymorphic matmul with planning/execution phase...

a = u.to_package()   # back in PyTorch land

However, to avoid having to deal with the UST explicitly, the implementation also provides (2) transparent injection into PyTorch using the TorchUST class, which behaves like a PyTorch tensor, but transparently use an UST implementation for the actual operations (if implemented). This is illustrated below:

import torch
from nvmath.sparse.ust.interfaces.torch_interface import TorchUST

# Construct two torch vectors.
x = (1.0 + torch.arange(32)).cuda()
y = (2.0 + torch.arange(32)).cuda()

# Perform a dot product.
z = torch.dot(x, y)

# Convert x to UST wrapped as TorchUST object
# (yes, the UST also includes dense vectors or tensors!)
x = TorchUST.from_torch(x)

# Now perform the same dot production operation.
# It transparently uses the UST implementation!
z = torch.dot(x, y)

A convenience method is provided that allows injecting the UST into all weight matrices of a model as follows.

weights = torchvision.models.get_model_weights(model_name).DEFAULT
model = torchvision.models.get_model(model_name, weights=weights)
model.to(device)
model.eval()
…
reformat_model(model, func=reformat)
…                                      
with torch.inference_mode():
    prediction = model(batch)

Here the reformatting function is written by the user and has the following form (note that we could even prune inside this function, but it is more common to use a pruning frameworks like torch.nn.utils.prune in combination with fine-tuning for accuracy before calling the reformat method to prepare inference). The UST now also allows for conversions into formats that are not supported by sparse PyTorch yet, such as diagonal forms or other novel formats. Hopefully this enables research into exploiting structure in weight matrices that were previously left unexplored.

def reformat(weight):
     # Inspect sparsity.
     nel = weight.numel()
     nnz = torch.count_nonzero(weight)
     sparsity = (1.0 - float(nnz) / float(nel)) * 100.0
     if sparsity >= sparse_threshold:
         # … pick suitable format for weight …
         return TorchUST.from_torch(weight)
    return None

Note that the UST is still in Beta, so a lot of functionality is still missing. But hopefully this is a first step towards a long objective of mine to treat sparsity as a property of tensors, and not as a tedious implementation detail. Let us know your thoughts and please send us ideas for improvements!

tensorm

Very interesting (in particular the first blog post you linked). IMO this seems like the future of index-based sparsity formats, if you manage to make it work on many operations and formats.

I tried it on the example of multiplying identity matrices:

A = TorchUST.from_torch(torch.eye(10, 10).to_sparse_csr())
B = TorchUST.from_torch(torch.eye(10, 10).to_sparse_csr())
C = A @ B

But obtained this error:

NotImplementedError: `TorchUST` mm: Only one operand can be sparse

I think it would be interesting to have the UST be closed under as many operations as possible, i.e. operation between 2 UST should be available and should return a UST (einsums in particular). Is it something planned, or is it out of scope?

Also, what’s the biggest use-case that you have in mind for this? Is it to encode the model parameters on sparse tensors, but still have dense activations (in forward) and gradients (in backward)?

Thank you! We here at NVIDIA are very enthusiastic moving towards a scalable, clean sparse ecosystem (and for who knows me, I have been personally working on that for a very long time now).

You are keen tester, going straight for the hard case! In this first beta, we only support “matmul” variations of the form C += AB where A is a TorchUST. For a case that is supported, either cuSPARSE or a lookup kernel is used, otherwise we JIT/LTO an on-the-fly generated kernel (right now using a very, very basic sparse emitter). But the setup is transparently cached, so the second time around you no longer pay the overhead of this. This is particularly useful when e.g. sparsifying the weights in a model for inference since, after the first call, things should run smoothly. But, to answer your actual question, ;-), yes we plan to implement much more than we currently have! And we appreciate any feedback on what to focus on first.

We want to provide the ecosystem that makes it as easy as possible for a model engineer to try out different sparsification schemes and storage formats. Right now, support is basic and geared towards weight sparsity. But we also plan to make activation sparsity easier to exploit (harder, since this is dynamic in nature, i.e. the sparse tensor needs to be composed prior to running a subsequent GEMM).

The current release is truly a Beta, with a lot of holes and performance gaps. But, we are throwing out the idea of the Universal Sparse Tensor to see if developers like it, and to get suggestions back on what to focus on first. So be nice to the current version, but we truly look forward to more feedback!

Thanks for the answers! I think I understand the intent much better now. I don’t have my laptop for the weekend but maybe I’ll try weights sparsification with this when I get back home.

I personally don’t work on model sparsification, but I got interested in sparsity for a different reason: when computing per-sample gradients (one gradient per element of the batch), you backpropagate a Jacobian that is block-diagonal (unless you have layers mixing the information from the elements of the batch, like BatchNorm).

So my use-case requires efficient representation and computations on block-diagonal tensors. We started working on such a sparsity format with my research partner @Pierre_Quinton. The main ideas are available here in case you’re interested: Sparse Latticed Tensors

I don’t think Sparse Latticed Tensors (SLTs) would be compatible with USTs, because we don’t map virtual indices to physical indices, but the opposite: we have an affine mapping from physical index to virtual index. This enables the representation of some very regular sparsity formats (diagonal, anti-diagonal, block-diagonal and many more). It also enables efficient computations, like einsums between SLTs that return another SLT. (This is probably why the first thing that came to my mind when trying USTs was to multiply two identity matrices). The thing is, we don’t even have an alpha version of this :smiling_face_with_tear:

Do you know of any work going in this direction?
And would you be interested in adding support for such regular sparsity formats in the long run?

I have to look at your link in more detail, but it sounds like this is exactly what the UST does. We represent formats, including block, dia, and antidia by mapping 2 dimensions to 2 or more levels. See below for examples:

  (i, j) -> (i / 2 : dense, j / 3 : compressed,
             i % 2 : dense, j % 3 : dense)         # BSR-Row(2,3), 2x3 blocks

And diagonals:

(i, j) -> (j-i : compressed, i : range)    # DIA-I
(i, j) -> (j-i : compressed, j : range)    # DIA-J
(i, j) -> (i+j : compressed, i : range)    # ANTI-DIA-I
(i, j) -> (i+j : compressed, j : range)    # ANTI-DIA-J

I think you’re right: the UST can represent everything that the SLT can. This is very good news!

To make it work for our use-case, we would need closure of USTs under einsum (i.e. einsum between two USTs returns a UST without densifying if not strictly needed) and, of course, high performance. If you’re willing to go in that direction, I would even be open for collaboration! We made SLTs specifically to make this possible: the structure is simple enough (affine mapping from physical index to virtual index) so that we hope a bit of lattice algebra solves all of our problems. But we’re still far from having solved everything.

Absolutely! Let’s meet 1:1 and discuss next steps. Please drop me an email (abik at …). Looking forward to that.