hi, i try to train a neural network to map inputs of shape [2, 257, 161] to outputs of shape [2, 73, 221]. (stft representation of audio data)
idea 1: encoder maps input to a flat tensor, reshape flat tensor to desired shape
idea 2: encoder maps input to flat tensor, decoder (some trans convs) create output of shape [2, 128, 256], use output[:, :73, :221] as final prediction
both ideas result in roughly the same bad valdiation loss, results show that the network has -some- idea of the data but the results are far from good.
any ideas on better architectures for this usecase?