Autograd failing for sparse matmul at half precision

Hello. I am currently attempting to run a full autocast-ed forward and backward pass over a mostly sparse model (the sparse weights are utilized each in only torch.sparse.mm() calls). However, if I store the weights in coo format, matrix multiplication simply isn’t implemented (at half precision), and for csr, the backward fails without a proper error message:

RuntimeError: sampled_addmm: Expected mat1 and mat2 to have the same dtype, but got Half and Float

Is there anything I can do to utilize autocast with this setup? Note that full-precision sparse training and half-precision dense training both work.

A simple reproduction of the printed error is below.

weight = torch.rand(5, 4, device = "cuda", dtype = torch.float32)
weight = (weight * (weight < 0.2)).to_sparse_csr().requires_grad_(True)

inp = torch.rand(4, 3, device = "cuda", dtype = torch.float32, requires_grad = False)

with torch.autocast("cuda", dtype = torch.float16, enabled = True):
	loss = torch.sparse.mm(weight, inp).sum()

loss.backward()

When autocast’s enabled=False, it works fine. When it is formatted to coo instead, it fails on the matmul.

cc @ptrblck

Sparse CSR support is experimental so this issue might be expected. Could you create a GitHub issue for this error assuming you are still seeing it in the latest nightly release, please?

It does remain in the latest nightly release - sure thing.