How to do metric learning for image based recommendations

Hi guys!

I am new to metric learning and trying to understand how it works at scale for building content based recommendations by looking at image similarity by representing an image as a n-dimensional feature vector.(n = 128, 512 etc.)

I am looking at in shop clothes retrieval dataset which contains quite a handful of products. Each product has around 3-5 images and theres close to a few thousand products.

Now, many papers recommend variants of contrastive loss or triplet loss functions for training. The definition of a class in these media is mentioned to be at a product level.

However, if we use product as a class, particularly wrt this dataset, then we risk separating the similar items because the way triplet loss works/contrastive loss works is it tries to maximize the embedding distance of images across classes while trying to bring the embeddings of images in one class closer and closer.

But the results obtained are really good which means inter class variance (for similar products) is somehow accounted for and not exploded by use of triplet or contrastive losses.

Can anyone help me understand the intuition behind setting the problem like this?

Thanks and Regards,