Manually calculate the gradient of a sparse matrix

I’m trying to calculate a gradient w.r.t a sparse matrix. It seems like pytorch’s autograd doesn’t support getting the gradient for sparse matrix so I want to calculate it manually if it’s possible.

the forward function is softmax(A*AXW). A is a sparse matrix and I want to calculate the gradient w.r.t A


I think it’d be easier to read if you wrote it as a PyTorch output, I cannot quite make out what A, X, and W are and what is the exact expression you have in the softmax.

In general the result of matrix multiplication Y = A @ B and some scalar-valued function f(Y) following it has gradients df/dA = (df/dY)@ B.t() , df/dB = A.t() @ (df/dY). Around PyTorch df/dY is sometimes called grad_out for the matrix multiplication and you see backpropagation at work…

Best regards


Sorry I think I was being unclear
the actual code goes like this

class GCN(nn.Module):
    def __init__(self, nfeat, nhid1, nhid2, nhid3, nclass, dropout, with_relu=True):
        super(GCN, self).__init__()

        self.gc1 = GraphConvolution(nfeat, nhid1)
        self.gc2 = GraphConvolution(nhid1, nclass)
        self.dropout = dropout
        self.with_relu = with_relu

    def forward(self, x, adj):
        axw , w0 = self.gc1(x, adj)
        aaxw0w1, w1 = self.gc2(x, adj)
        return F.log_softmax(aaxw0w1, dim=1)


class GraphConvolution(Module):
    Simple GCN layer, similar to

    def __init__(self, in_features, out_features, bias=False):
        super(GraphConvolution, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight = Parameter(torch.FloatTensor(in_features, out_features))
        if bias:
            self.bias = Parameter(torch.FloatTensor(out_features))
            self.register_parameter('bias', None)

    def reset_parameters(self):
        stdv = 1. / math.sqrt(self.weight.size(1)), stdv)
        if self.bias is not None:
  , stdv)

    def forward(self, input, adj):
        support =, self.weight)
        output = torch.spmm(adj, support)
         return output, self.weight

    def __repr__(self):
        return self.__class__.__name__ + ' (' \
               + str(self.in_features) + ' -> ' \
               + str(self.out_features) + ')'

In my neural network X would be the input and adj is used in the hidden layer and output layer and I want to get the gradient w.r.t the adj.

Thank you!

Oh, it seems that torch.spmm is confusing.
With, you can have gradients all right.

a = torch.randn(2, 3).to_sparse().requires_grad_(True)
b = torch.randn(2, 3, requires_grad=True),b.t()).sum().backward()

Best regards


Yes, this works!

However when I tried it with a large sparse matrix, it is taking so much time to get the gradient. Is there a way to go around this problem?


What’s large here?

The problem is likely (and ha, now the formula is useful after all) that the to computation df/dA = (df/dY)@ B.t() is in all dense matrices and you don’t, in general, have the sparseness same pattern in the gradient. What PyTorch does under the hood is to compute the dense derivative (large) and then apply the sparseness pattern.
You could try to implement your own sparse_mm with backwards using scatter_add from the 3rd party PyTorch scatter package or somesuch.

Note that the sparse derivative “enforces” the sparseness pattern (this is often desired even though the there may be gradients outside the mask).

Best regards


By large, I ment a sparse matrix with shape(100k, 100k).
I’ll try to implement my own fuction as you explained.

thank you!

Have you implemented your own function for this?