Help or Suggestion Needed in Designing CNN

I am designing an network in which I am passing an vector and Image as Input.

I wonder how can I design such network and loss.

Approach 1 (Not preffered): Flatten the Image then concat with that vector and send it to linear layer. But this won’t capture any features of the input image.

Approach 2: Can I design something combination of Convolution and linear at the input level itself? If so How can I do that.

Approach 3 : Have Two different networks one for convolution and other for the latent vector and combine them at later. But I fear this might lead to loss not getting decreased

Any Help or Opinion or Suggestion would be great.

I am referring to this below post for designing it