Suitable DNN architecture to map noisy tensors to its clean version

I have a dataset represents noisy speech embeddings and clean speech embeddings. The dataset is saved as torch tensors each tensor has a shape of 512 x 350. I built a CNN model based on the U-Net architecture to map the noisy embeddings to its clean embeddings, but the results are not optimistic. Can anyone help or suggest a model that can perform better in this task. Thanks