I am trying to invert wav2vec 2.0
and it looks like it takes 400 samples and converts to a 512-dimensional vector. I’m having a hard time figuring out how to invert it. I tried doing a straight mapping from 512 => 400, but it doesn’t give great results, even when overfitting to a handful of samples. I think it’s because I need to include more temporal information.
So if 400 samples (25ms) converts to a single 512-dim vector, then 1200 samples (75ms) will give me 3 512-dim vectors.
How can I take those 3 vectors and convert back to 1200 samples? What architecture would you recommend?