after studying codes from different authors, I’ve seen different approaches to encode an image into a latent variable.
1 ) [N, 1, 28, 28] → Conv → [N, 32, 14, 14] → Conv → [N, 64, 7, 7] → Conv [N, 128, 1, 1]
→ fc → [N, latent_dim]
2 ) [N, 1, 100, 100] → Conv → [N, 10, 48, 48] → Conv → [N, 20, 22, 22] → Conv [N, 1, 14, 14]
→ flatten → fc → [N, latent_dim]
I’ve heard that the channels should increase, while the width should decrease. However in the 2) example the channels decrease at the end again back to 1. The use case for 2) is to encode a geographic map for vehicle trajectory prediction, in case you wonder. However it proved to be absolutely useless for my MNIST Variational Autoencoder.
How do I know when to use this kind of approach?
Also regarding 1), are there any other (preferably better) ways to do this, when the goal is to build a VAE?
Thanks in advance
(I think you might need a flatten in 1, too.)
We have a short discussion of the more general number of channels and width in our book (Stevens/Antiga/Viehmann: Deep Learning with PyTorch, Section “8.3.1 Our network as an nn.Module”).
So the idea behind “channels should increase while width should decrease” is mainly “total size should decrease but not too much”. That the channels increase is more a way to achieve “not too much” in general (so halving the spatial resolution + doubling channels halves the number of elements).
Aside from the size (196 vs 128), the key difference between 1 and 2 is that the first evicts all explicitly spatial information while the second keeps it but reduces the information “per pixel”. Which is more appropriate is likely task dependent. (And you could mix, too, if you feel like it.)
For the VAE, it might make sense to to keep some spatial information as might want to be able reconstruct flipped/rotated images (as a vague intuition), but the typical thing might be to keep more than one channel. But I would recommend to check out the common VAE architectures. It is still good to ask “why” they did it this way, but it’d give you a starting point.