i trained a classification model on skills using BERT.
the model gives different results when it predicts on a single skill vs a data frame of skills.
this is the code i use for inference on single words:
def predict_skill(skill: str):
encoded_dict = tokenizer.encode_plus(
skill,
add_special_tokens = True,
max_length = 32,
pad_to_max_length = True,
return_attention_mask = True,
return_tensors = 'pt')
skill_input_ids = encoded_dict['input_ids']
skill_attention_masks = encoded_dict['attention_mask']
skill_pred = model(skill_input_ids, skill_attention_masks)['logits'].cpu().detach().numpy()
skill_pred_label = np.argmax(skill_pred, axis=1)
return CategoriesD[skill_pred_label[0]]
Ps: CategoriesD is the skills categories dictionnay.
and this is the code i use when predicting on a dataframe of skills:
test_preds = []
for input_ids, attention_masks in test_dataloader:
input_ids = input_ids.to(device)
attention_masks = attention_masks.to(device)
# get the predicted labels for the test data
test_logits = model(input_ids, attention_masks)['logits'].cpu().detach().numpy()
test_preds.append(np.argmax(test_logits, axis=1).flatten())
test_preds = np.concatenate(test_preds, axis = 0)
I am using the same parameters and the same tokenizer also, it gives the same words different categories based on wether they are fed to the model in a dataframe or as single words, and i can’t figure out why!
PS: it gives better classification when it’s single skills!