Manually calculate the gradient of a sparse matrix

Kai_Eiji · June 20, 2020, 5:19pm

Hi,
I’m trying to calculate a gradient w.r.t a sparse matrix. It seems like pytorch’s autograd doesn’t support getting the gradient for sparse matrix so I want to calculate it manually if it’s possible.

the forward function is softmax(A*AXW). A is a sparse matrix and I want to calculate the gradient w.r.t A

thanks!

tom · June 20, 2020, 5:29pm

I think it’d be easier to read if you wrote it as a PyTorch output, I cannot quite make out what A, X, and W are and what is the exact expression you have in the softmax.

In general the result of matrix multiplication Y = A @ B and some scalar-valued function f(Y) following it has gradients df/dA = (df/dY)@ B.t() , df/dB = A.t() @ (df/dY). Around PyTorch df/dY is sometimes called grad_out for the matrix multiplication and you see backpropagation at work…

Best regards

Thomas

Kai_Eiji · June 20, 2020, 6:03pm

Sorry I think I was being unclear
the actual code goes like this

class GCN(nn.Module):
    def __init__(self, nfeat, nhid1, nhid2, nhid3, nclass, dropout, with_relu=True):
        super(GCN, self).__init__()

        self.gc1 = GraphConvolution(nfeat, nhid1)
        self.gc2 = GraphConvolution(nhid1, nclass)
        self.dropout = dropout
        self.with_relu = with_relu

    def forward(self, x, adj):
        axw , w0 = self.gc1(x, adj)
        aaxw0w1, w1 = self.gc2(x, adj)
        return F.log_softmax(aaxw0w1, dim=1)

and

class GraphConvolution(Module):
    """
    Simple GCN layer, similar to https://arxiv.org/abs/1609.02907
    """

    def __init__(self, in_features, out_features, bias=False):
        super(GraphConvolution, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight = Parameter(torch.FloatTensor(in_features, out_features))
        if bias:
            self.bias = Parameter(torch.FloatTensor(out_features))
        else:
            self.register_parameter('bias', None)
        self.reset_parameters()

    def reset_parameters(self):
        stdv = 1. / math.sqrt(self.weight.size(1))
        self.weight.data.uniform_(-stdv, stdv)
        if self.bias is not None:
            self.bias.data.uniform_(-stdv, stdv)

    def forward(self, input, adj):
        support = torch.mm(input, self.weight)
        output = torch.spmm(adj, support)
         return output, self.weight

    def __repr__(self):
        return self.__class__.__name__ + ' (' \
               + str(self.in_features) + ' -> ' \
               + str(self.out_features) + ')'

In my neural network X would be the input and adj is used in the hidden layer and output layer and I want to get the gradient w.r.t the adj.

Thank you!

tom · June 20, 2020, 6:31pm

Oh, it seems that torch.spmm is confusing.
With torch.sparse.mm, you can have gradients all right.

a = torch.randn(2, 3).to_sparse().requires_grad_(True)
b = torch.randn(2, 3, requires_grad=True)
torch.sparse.mm(a,b.t()).sum().backward()

Best regards

Thomas

Kai_Eiji · June 20, 2020, 7:14pm

Yes, this works!

However when I tried it with a large sparse matrix, it is taking so much time to get the gradient. Is there a way to go around this problem?

thanks.

tom · June 20, 2020, 8:45pm

What’s large here?

The problem is likely (and ha, now the formula is useful after all) that the to computation df/dA = (df/dY)@ B.t() is in all dense matrices and you don’t, in general, have the sparseness same pattern in the gradient. What PyTorch does under the hood is to compute the dense derivative (large) and then apply the sparseness pattern.
You could try to implement your own sparse_mm with backwards using scatter_add from the 3rd party PyTorch scatter package or somesuch.

Note that the sparse derivative “enforces” the sparseness pattern (this is often desired even though the there may be gradients outside the mask).

Best regards

Thomas

Kai_Eiji · June 20, 2020, 10:47pm

By large, I ment a sparse matrix with shape(100k, 100k).
I’ll try to implement my own fuction as you explained.

thank you!

haalim · September 14, 2020, 12:50pm

Have you implemented your own function for this?

Jiuhnny · January 17, 2024, 7:30pm

Hi, did you mean only non-zeros of the original sparse matrix have the gradient of the dense matrix df/dA = (df/dY)@ B.t() ? I mean, does the dense matrix df/dA has the same sparse pattern as A? Thanks

tom · January 17, 2024, 8:06pm

Mathematically, it does not. To the mathematical gradient, the “missing” entries in the sparsity pattern are just things that happen to be zero. The computation

a = torch.sparse_coo_tensor(torch.tensor([[1, 2], [0, 1]]), torch.randn(2), size=(3, 3), requires_grad=True)
b = torch.randn(2, 3, requires_grad=True)
torch.sparse.mm(a,b.t()).sum().backward()

gives you an a.grad with the same sparsity pattern as a, so it essentially computes the gradient of a masked tensor (because mathematically, the gradient a * mask w.r.t. a is the mask).

Best regards

Thomas

Jiuhnny · January 17, 2024, 9:56pm

Thank you for the reply!
I’m writing my spmm (sparse-dense multiplication) kernel as what ge-spmm did. https://github.com/hgyhungry/ge-spmm/blob/master/pytorch-custom/op.py

From their code (SPMMFunction from Line 8), I found they only computed the gradient of the dense matrix for the backward. Do you mean dy/dA is also required? so grad_out @ b.t() should be returned in backward (Line36) as well?