My classification model is giving me different predictions for the same word when it's alone and when its in a dataframe

i trained a classification model on skills using BERT.
the model gives different results when it predicts on a single skill vs a data frame of skills.
this is the code i use for inference on single words:

def predict_skill(skill: str):
    encoded_dict = tokenizer.encode_plus(
                        skill,                     
                        add_special_tokens = True,
                        max_length = 32,          
                        pad_to_max_length = True,
                        return_attention_mask = True,  
                        return_tensors = 'pt')
    
    skill_input_ids = encoded_dict['input_ids']
    skill_attention_masks = encoded_dict['attention_mask']
    skill_pred = model(skill_input_ids, skill_attention_masks)['logits'].cpu().detach().numpy()
    skill_pred_label = np.argmax(skill_pred, axis=1)
    
    return CategoriesD[skill_pred_label[0]]

Ps: CategoriesD is the skills categories dictionnay.

and this is the code i use when predicting on a dataframe of skills:

test_preds = []

for input_ids, attention_masks in test_dataloader:
    input_ids = input_ids.to(device)
    attention_masks = attention_masks.to(device)  

    # get the predicted labels for the test data
    test_logits = model(input_ids, attention_masks)['logits'].cpu().detach().numpy()
        
    test_preds.append(np.argmax(test_logits, axis=1).flatten())
    
test_preds = np.concatenate(test_preds, axis = 0)

I am using the same parameters and the same tokenizer also, it gives the same words different categories based on wether they are fed to the model in a dataframe or as single words, and i can’t figure out why!
PS: it gives better classification when it’s single skills!

Is your model using layers which use some batch stats such as batchnorm layers and if so, did you call model.eval() before comparing the outputs?
If that’s the case, could you compare the output of a single sample vs. a batch using the same repeated sample?

thatks for the reply @ptrblck, initially I was using model.eval() and they were different predictions so i did a comparison without using model.eval() and i used the same code, the only difference is the batch in a dataloader and the samples are just text!