VAE's: Residual CNN as encoder

Hello everyone, I rather encounter a theoretical issu with VAE.
I aim to maximize image-patches similarity across single input image scales, and I want to implement an unsupervised VAE for this task.
My first question would be: Is it “theoretically correct” to create residual connexion in the VAEs CNN encoder, connecting each patch (from each scale) directly with the latent space ? And then try to get the patch-similarity distribution?

Thank you