Freeze 3D backbone before nn.MultiheadAttention


I try to train a multimodal system using nn.MultiheadAttention for cross-attention. When I freeze the 3D backbone responsible for one of the branches the model fails to learn any useful relationships. However it does seem to learn if the layers are trainable. Why does this happen? As far as I can see from the source code there is an embedding layer included in nn.MultiheadAttention so it should still be able to train this without having to rely on previous layers in the network?