Hi Filos,
it’s a bit hard to give you a general answer, as a suitable model architecture will most likely depend on the problem you are dealing with.
One way to deal with this kind of data is to use separate model “paths”, e.g. a CNN for your image input and a feed-forward model for the other features, concatenate these at some point and feed them to the last classifier layers.
Here is a simple example of such a model.