Architecture suggestions for learning image to image?

hi, i try to train a neural network to map inputs of shape [2, 257, 161] to outputs of shape [2, 73, 221]. (stft representation of audio data)

idea 1: encoder maps input to a flat tensor, reshape flat tensor to desired shape
idea 2: encoder maps input to flat tensor, decoder (some trans convs) create output of shape [2, 128, 256], use output[:, :73, :221] as final prediction

both ideas result in roughly the same bad valdiation loss, results show that the network has -some- idea of the data but the results are far from good.

any ideas on better architectures for this usecase?