Self attention and feature fusion over graphs

Hi

This is a repost from here, as I’m hoping someone in this forum might have some additional expertise in feature fusion.

Is it possible to perform feature fusion using a self-attention mechanism? I have data distributed over a graph, and multiple feature modalities sampled per node that I would like to “optimally” combine for a node classification problem. Let’s say that I have a graph $G=(V,E)$ with $N$ nodes. For each node, we sample $M$ different modalities, such that each node $v$ is characterized by $M$ feature vectors $\in \mathbb{R}^{m_{1}}, \mathbb{R}^{m_{2}}…\mathbb{R}^{m_{M}}$. For simplicity, assume $m_{1} = m_{2} = … = m_{M} = D$. I’m interested in “fusing” these features in a more intelligent way than simply concatenating them and passing them through a linear layer or MLP.

Assuming each node $v$ is characterzed by a feature matrix $x_{i} \in \mathbb{R}^{M \times D}$ (where $M$ is the number of modalities, and $D$ the input dimension), a transformer approach over modalities would yield something like this:

\begin{align} Q = x_{i}W_{q} \ K = x_{i}W_{k} \ V = x_{i}W_{v} \end{align}

where W_{q,k,v} \in \mathbb{R}^{D \times p}. We then have that \begin{align} Z = softmax(\frac{QK^{T}}{\sqrt{D}}, axis=1)V \end{align} whereZ \in \mathbb{R}^{M \times D} and $Z_{i}$ is the fusion of modalities with respect to modality $i$ (I think – correct me if I’m wrong, please). I could then sum over the rows of $Z$ to compute the final combination. Does this seem correct? I havn’t found many papers on using self-attention for feature fusion, so any help is much appreciated.

k

So I used the link for the better maths rendering, but you have two different uses of the index i there: in the input, it is an index over the nodes, in the output, you use it as an index over the modality in your question. But in self-attention, one would typically expect the inputs and the outputs to match in shape, so if your inputs are N x M x D in your notation, so would your output be of the same.

That said, attention seems like a thing to try in a situation like yours. If you want to go to N x D outputs, there would seem to several options

• pool over the features after the attention,
• have a “node token” additional modality that works similar to the class token in BERT / vision transformers, in particular if you have several attention blocks stacked as in transformers,
• go the “external attention” way of separating Q from KV and make Q independent of the modality, this would make your output independent of the modality. However, then the node must “know” which modalities to attend to.

Best regards

Thomas

Hi @tom

Thank you for the insights – to your first point, I did incorrectly use the $i$ index twice, my mistake

I think a permutation invariant function for aggregating attentions as suggested in your first point would likely be the easiest approach.

I’m not sure I understand the “external attention” approach, could you elaborate a bit here?

k

So to me, the defining bit for self-attention is that QKV all come from the same input.
Using the image from the famous Vasvani et al: Attention is all you need paper: In the decoder, you have an attention where the Q comes from the decoder while KV is on the encoder (so the “decoder looks at the encoder” - the red circle):

But in your case it may be natural to take the Q just from the node, so the linear for Q takes N x D and the linear for KV take N X M x D. This may or may not work better than just pooling over the modalities after the attention layer.

Best regards

Thomas