Developing a Pytorch model to determine whether one image appears in another.

Hello everyone,
I have a dataset with a collection of images called “outfits,” and for each outfit image, there is a list of images for the items that make up the outfit. Given two photos, for example, image X and image a, the model should recognize the two images and predict whether the second image is part of the outfit or not. In addition, if the second image appears in the first image, the model should predict the bounding box for this object. I’m new to Pytroch and have no idea how to develop such a model.
My initial attempt is to use one of Pytorch’s pretrained models to obtain the two image embeddings (such as X and an or X and j), then concatenate the two embeddings and output the predictions and bounding box using a head. The model, however, never converges, and I get no results. I’m now stuck at this stage. Any advice on how to put this idea into practice, as well as any materials on an issue or implementation that is comparable to mine, would be really appreciated. Also, any advice on how to train such a model and if it is a difficult or simple challenge to train a network to predict such an output would be greatly appreciated. Thank you ahead of time.Thank you in advance.