Vq-diffusion for face generation

I’m working on VQ-Diffusion i will appreciate some suggestion. The task is: use diffusion model to generate new faces. The input images have size (3x64x64), so i want to encode them to work on the latent space. My questions are: which are the best size and the number of feature maps i should give in input at the Unet used for the backward pass in the diffusion process? Is it useful to work on the latent space?

