How to output different size than embed_dim with MultiheadAttention?

I am trying to use self-attention to filter the content of an image, for that I actually provide different tensors for Q, V and K respectively (making sure that the dimensions are compatible with the matrix multiplications). My issue is that the final output has the dimension of Q (i.e. embed_dim) when I would like it to be of the dimension of V (vdim). Looking at the doc, I think it is not possible, but I feel this feature should be there somehow.
Before I start trying to write an implementation of the self-attention, does anyone know how to do what I described?