I have my own tuned CLIP using convnext_tiny and rubert-tiny. I want to make siamese net for further comparison of pair of objects. Each object has it’s name, description and image.
I’m using outputs of the BERT model:
x = bert(input_ids=input_ids, attention_mask=attention_mask)
x = x.last_hidden_state[:, 0, :]
x = final_ln(x)
where bert takes outputs from tokenizer and final_ln
is the final layer of size (312x768), which was also fitted while fitting CLIP model with.
For image encoder I take create_model
from timm
with num_classes=0
for reducing linear head. Then I’m concatenating it in two big vectors:
first_emb = torch.cat([image_emb1, name_emb1, desc_emb1], dim=1)
second_emb = torch.cat([image_emb2, name_emb2, desc_emb2], dim=1)
And then forwarding trough MLP layer, which consists of dense layers, dropout and batchnorm with relu as activation:
out1 = head(first_emb)
out2 = head(second_emb)
My questions are:
- aren’t this operations with bert like
.last_hidden_state
are equivalent of just pooling therefore loosing some information? - Is there better way to use bert for further using this models with subsequent MLP?
- Is it proper way to make siamese clip network?