How to deal with spectrograms of different sizes

So, I’m building a variatoinal autoencoder but I don’t know how to define the input shape of my network, since each spectrogram of my dataset has a different shape.

How do i deal with that? I think i can’t simply resize it, since it’s a spectrogram and the audio will suffer from changes.