I’m working on a graph neural network with dgl library. What makes me uncertain is that at one point, I calculate a pairwise attention using a simple MLP between adjacent node, and create a new node in the graph if a certain pair has scores higher than some threshold. This graph will be used in the next step to do graph convolution. I don’t think I can write this operation in closed form, and thus I’m not sure if pytorch can differentiate this either.
I think it depends what you want to differentiate.
If you generate a new embedding for that new node and add it to the current graph state before further processing. I think this can be nicely differentiable.
If you want gradients for things like what this node should be connected to or added at all, it is going to be trickier as these quantities are usually discrete.
Thank you for the quick reply. Does “tricky” in this scenario mean you can’t learn anything in the MLP? Do you have anything for me to read to possibly address this? Still huge thanks if there isn’t any, this seems like a tough question.
What I have on top of my head is I can probably train the MLP separately before actually training other parts of the network, using some carefully designed loss?
You cannot define gradients for discrete quantities. So tricky here mean not possible
A trick that can be used to get around this is to make these quantities continuous. But this is not always possible and can have very weird interpretation depending on the application. For the connectivity here for example, I’m not sure what it would mean to have an adjacency matrix with continuous values?