Model to predict bounding box given 2 images as input

I am trying to build a model that can probabilistically identify closest subregion in an image corresponding to another image.
How do i combine the features from the 2 images to predict the bounding box?