Model Overfits One Feature And Underfits Other Feature. Different Divergence Speed

I am currently working on a multimodal(text input and image input) transformer model.

Trained only on the text input the model converges after 15 epochs

Trained only on the image input the model converges after 30 epochs and performs poorly after 15 epochs.

When I train on both features, the results are close to what the “only on text” model produces

Any idea on how to make my model learn faster on the image component to make the model learn equally fast on both modalities?