Get well adjusted confidence scores from similarity of CLIP encodings

I am using CLIP to check similarity between text and an image. Now for example I have list of words (objects) I want to check against. For example (“elephant”, “tiger”, “giraffe”).

By taking the dot product of the encodings I get the similarity value. To evaluate the “confidence” I take the softmax over the outputs and it works very well predicting which class is in the image. But it could happen that the classes are not mutually exclusive. In that case softmax doesn’t make sense. I tried to use sigmoid as it is used with multi-label classification, but it seems to give me values all around 0.55 (so classes that were correct around around 0.56 and classes that are wrong 0.54), so in the example (0.565, 0.55, 0.62) if elephant and giraffe are in the picture. Thus it is hard to set a threshold there.

I would like to have something like (0.95, 0.05, 0.98) if elefant and giraffe are in the picture, thus the similarity is high for both words.

Am I thinking too complicated and there is a standard way to do this? Is it even possible to get this well adjusted confidence score?