Do PyTorch transformers split the feature dimension?

I read that the PyTorch implementation of the multihead attention actually splits the features by the head count. Would that not mean that communication between these features is impossible and that using more than one head could present a significant bottleneck?

This is the approach proposed in the Transformer in the original paper “Attention Is All You Need”. Their motivation is that the total number of parameters should not depend on the number of heads – at least not without also increasing the core model size.

I’m not quite sure what you mean by the communication of these features. The purpose of multiple heads is to some extend to be independent to potentially pick up on different patterns/aspects in the data.

Would that not imply that the end result of the attention layer potentially can’t determine correlations between two features? Or at least can’t measure their importances against each other?