Hi,

I am fairly new to PyTorch and Attention. As I understand MultiheadAttention maps a fixed sequence of vectors of length T to an embedded sequence of length T. I would be interested in constructing this operation as a convolutional operation e.g. convolving the MultiheadAttention operation over a sequence of length N > T. I understand that this would produce a matrix (amount_of_slides x embedded_sequence_of_length_T).

The reason that I am interested in doing so, is that I have a sequence of length N, where contiguous subsequences of length T < N have some sort of relation. I want to map the original sequence of length N to a new sequence. It however (1) seems computational expensive to use MultiheadAttention on the entire sequence, considering that there are only relations in the subsequences and (2) the sequence can be of variable length, hence my curiosity in using the MultiheadAttention as a convolutional operation.

Does anyone know of an implementation of this?

Thanks!