Manually calculate the gradient of a sparse matrix

Hi,
I’m trying to calculate a gradient w.r.t a sparse matrix. It seems like pytorch’s autograd doesn’t support getting the gradient for sparse matrix so I want to calculate it manually if it’s possible.

the forward function is softmax(A*AXW). A is a sparse matrix and I want to calculate the gradient w.r.t A

thanks!

I think it’d be easier to read if you wrote it as a PyTorch output, I cannot quite make out what A, X, and W are and what is the exact expression you have in the softmax.

In general the result of matrix multiplication Y = A @ B and some scalar-valued function f(Y) following it has gradients df/dA = (df/dY)@ B.t() , df/dB = A.t() @ (df/dY). Around PyTorch df/dY is sometimes called grad_out for the matrix multiplication and you see backpropagation at work…

Best regards

Thomas

Sorry I think I was being unclear
the actual code goes like this

class GCN(nn.Module):
    def __init__(self, nfeat, nhid1, nhid2, nhid3, nclass, dropout, with_relu=True):
        super(GCN, self).__init__()

        self.gc1 = GraphConvolution(nfeat, nhid1)
        self.gc2 = GraphConvolution(nhid1, nclass)
        self.dropout = dropout
        self.with_relu = with_relu

    def forward(self, x, adj):
        axw , w0 = self.gc1(x, adj)
        aaxw0w1, w1 = self.gc2(x, adj)
        return F.log_softmax(aaxw0w1, dim=1)

and

class GraphConvolution(Module):
    """
    Simple GCN layer, similar to https://arxiv.org/abs/1609.02907
    """

    def __init__(self, in_features, out_features, bias=False):
        super(GraphConvolution, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight = Parameter(torch.FloatTensor(in_features, out_features))
        if bias:
            self.bias = Parameter(torch.FloatTensor(out_features))
        else:
            self.register_parameter('bias', None)
        self.reset_parameters()

    def reset_parameters(self):
        stdv = 1. / math.sqrt(self.weight.size(1))
        self.weight.data.uniform_(-stdv, stdv)
        if self.bias is not None:
            self.bias.data.uniform_(-stdv, stdv)

    def forward(self, input, adj):
        support = torch.mm(input, self.weight)
        output = torch.spmm(adj, support)
         return output, self.weight

    def __repr__(self):
        return self.__class__.__name__ + ' (' \
               + str(self.in_features) + ' -> ' \
               + str(self.out_features) + ')'

In my neural network X would be the input and adj is used in the hidden layer and output layer and I want to get the gradient w.r.t the adj.

Thank you!

Oh, it seems that torch.spmm is confusing.
With torch.sparse.mm, you can have gradients all right.

a = torch.randn(2, 3).to_sparse().requires_grad_(True)
b = torch.randn(2, 3, requires_grad=True)
torch.sparse.mm(a,b.t()).sum().backward()

Best regards

Thomas

Yes, this works!

However when I tried it with a large sparse matrix, it is taking so much time to get the gradient. Is there a way to go around this problem?

thanks.

What’s large here?

The problem is likely (and ha, now the formula is useful after all) that the to computation df/dA = (df/dY)@ B.t() is in all dense matrices and you don’t, in general, have the sparseness same pattern in the gradient. What PyTorch does under the hood is to compute the dense derivative (large) and then apply the sparseness pattern.
You could try to implement your own sparse_mm with backwards using scatter_add from the 3rd party PyTorch scatter package or somesuch.

Note that the sparse derivative “enforces” the sparseness pattern (this is often desired even though the there may be gradients outside the mask).

Best regards

Thomas

By large, I ment a sparse matrix with shape(100k, 100k).
I’ll try to implement my own fuction as you explained.

thank you!

Have you implemented your own function for this?

Hi, did you mean only non-zeros of the original sparse matrix have the gradient of the dense matrix df/dA = (df/dY)@ B.t() ? I mean, does the dense matrix df/dA has the same sparse pattern as A? Thanks

Mathematically, it does not. To the mathematical gradient, the “missing” entries in the sparsity pattern are just things that happen to be zero. The computation

a = torch.sparse_coo_tensor(torch.tensor([[1, 2], [0, 1]]), torch.randn(2), size=(3, 3), requires_grad=True)
b = torch.randn(2, 3, requires_grad=True)
torch.sparse.mm(a,b.t()).sum().backward()

gives you an a.grad with the same sparsity pattern as a, so it essentially computes the gradient of a masked tensor (because mathematically, the gradient a * mask w.r.t. a is the mask).

Best regards

Thomas

Thank you for the reply!
I’m writing my spmm (sparse-dense multiplication) kernel as what ge-spmm did. https://github.com/hgyhungry/ge-spmm/blob/master/pytorch-custom/op.py

From their code (SPMMFunction from Line 8), I found they only computed the gradient of the dense matrix for the backward. Do you mean dy/dA is also required? so grad_out @ b.t() should be returned in backward (Line36) as well?