Concat two models with differents size input

I have a dataset of videos (60); from this videos I extract images and audio,
from each video I extract 30 images, so in total 1800 images, the shape of x_train and y_train is
x_train.shape => (1800,224,224,3),
y_train.shape => (1800,5)

From audio i extract 15 signal (array of numbers), so in total 60*15=900, the shape of x_train and y_train is
x_train.shape => (900, 128)
y_train.shape => (900,5)

images are fed into VggFace (fine tuned ) model(1); the output shape is (1800,128)
audios are fed into a Vggish model (2), the output shape is (900, 128)

after training two model (1 and 2)
Both models are used as input for third model3; The problem I faced is that:[x_audio_train, x_video_train], y_train, …)
I got the error:

the input size should be the same.

I hope you got my issue;

How can I fix this ?
Thank you

This sounds wrong as it seems you are increasing the number of samples in the output of VggFace assuming dim0 represents the batch dimension.

I don’t know what defines but I assume it’s trying to stack the input tensors somehow.
Could you post the fit source code as I assume it’s coming from a higher-level API?