Proper way to make siamese clip not loosing information from text and image encoders

I have my own tuned CLIP using convnext_tiny and rubert-tiny. I want to make siamese net for further comparison of pair of objects. Each object has it’s name, description and image.

I’m using outputs of the BERT model:

x = bert(input_ids=input_ids, attention_mask=attention_mask)
x = x.last_hidden_state[:, 0, :]
x = final_ln(x)

where bert takes outputs from tokenizer and final_ln is the final layer of size (312x768), which was also fitted while fitting CLIP model with.
For image encoder I take create_model from timm with num_classes=0 for reducing linear head. Then I’m concatenating it in two big vectors:

first_emb = torch.cat([image_emb1, name_emb1, desc_emb1], dim=1)
second_emb = torch.cat([image_emb2, name_emb2, desc_emb2], dim=1)

And then forwarding trough MLP layer, which consists of dense layers, dropout and batchnorm with relu as activation:

out1 = head(first_emb)
out2 = head(second_emb)

My questions are:

  1. aren’t this operations with bert like .last_hidden_state are equivalent of just pooling therefore loosing some information?
  2. Is there better way to use bert for further using this models with subsequent MLP?
  3. Is it proper way to make siamese clip network?