Hello,
I am trying to create embeddings first before feeding into the model. Why I need to create the embeddings first is because of max_seq_len, I want to make use of long sequences and I think the best way is to generate embeddings, fine-tune on the embeddings but I have ran into issues with it and tried out things not working yet.
I am using the sentence bert model [query,positive,negative] pairs.
model = SentenceTransformer(model_name)
query = qry + ' ' + history + ' ' + tpl.Occ.strip().lower() + ' ' + tpl.Pr.strip().lower() + ' ' + tpl.D.strip().lower() + ' ' + tpl.G.strip().lower() + ' ' + tpl.D.strip().lower() + ' ' + tpl.HomeOwner.strip().lower()
emb_query = np.array(model.encode(sentences=query, normalize_embeddings=True, show_progress_bar=True))
emb_positive = np.array(model.encode(sentences=positive, normalize_embeddings=True, show_progress_bar=True))
emb_hneg = np.array(model.encode(sentences=hard_negative, normalize_embeddings=True, show_progress_bar=True))
train_data['anchor'].append(emb_query)
train_data['positive'].append(emb_positive)
train_data['negative'].append(emb_hneg)
train_dataset = Dataset.from_dict(train_data)
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
loss=train_loss,
evaluator=dev_evaluator,
)
trainer.train()
Now I get the error that unhashable list. I have tried converting to numpy array and even explored using tuple but by the time I use train_dataset = Dataset.from_dict(train_data), it converts it back to list. [[0.1,0.1], [0.1,0.1]]. If I use raw text, it is fine but I want to maximise my max_seq_len and I don’t have the luxury of using bigger max_se_len beyond 128