Is there a way to do multi-label classification with CLIP?

The concrete use case is a as following. I have the classes baby, child, teen, adult. My idea was to use similarity between text and image features (for text features I used the prompt ‘there is at least one (c) in the photo’, c being one of the 4 classes).

I went through quite a lot of examples, but I am running into the issue that the similarity scores are often very different for a fixed class or/and classes that appear might have a very similar threshold (like baby and child). For similarity scores I use the cosine similarity multiplied by 2.5 to stretch the score into the interval [0, 1] as is done in the CLIP Score paper.

Setting a threshold in that sense doesn’t seem possible.

Does anyone have an idea for that? I feel quite stuck here, how I should proceed.