Trying to mimic a transformation

I’d like to create and train a network to reproduce the effects of a (complex) sort of filter applied to sound files.
My idea is to create a dataset with several input files containing (for example) 512 floats in [-1, +1] representing some sounds, and their output counterparts generated by applying the filter to them.

I intend to try at first an MLP network. It would have 512 input neurons and 512 output neurons, with several hidden layers. The objective is that the output of the network fits the filtered sounds.

Is this correct ? How can I begin this project ?
Should I use MSE loss?
Are there any useful links to show me how to do that ?
Thanks for your help