Hello everyone,
after studying codes from different authors, I’ve seen different approaches to encode an image into a latent variable.
1 ) [N, 1, 28, 28] → Conv → [N, 32, 14, 14] → Conv → [N, 64, 7, 7] → Conv [N, 128, 1, 1]
→ fc → [N, latent_dim]
2 ) [N, 1, 100, 100] → Conv → [N, 10, 48, 48] → Conv → [N, 20, 22, 22] → Conv [N, 1, 14, 14]
→ flatten → fc → [N, latent_dim]
I’ve heard that the channels should increase, while the width should decrease. However in the 2) example the channels decrease at the end again back to 1. The use case for 2) is to encode a geographic map for vehicle trajectory prediction, in case you wonder. However it proved to be absolutely useless for my MNIST Variational Autoencoder.
How do I know when to use this kind of approach?
Also regarding 1), are there any other (preferably better) ways to do this, when the goal is to build a VAE?
Thanks in advance