Reduction in performance of quantized bert model

I am using dynamic quantization on fine-tuned bert model. When I performed inference on quantized model before saving it, I am getting almost similar results(accuracy score) between unquantized and quantized model and reduction in inference time too.

However, when I load the quantized model and do inference on that, there is significant difference (around 30 to 40% decrease in accuracy) in the results, Is this because of way of loading the quantized model?

Any leads will be appreciable…
Thanks

Following is the code

def load_model(args):

config = BertConfig.from_pretrained(args.model_dir)
tokenizer = BertTokenizer.from_pretrained(
    args.model_dir, do_lower_case=args.do_lower_case
)
model = BertForSequenceClassification.from_pretrained(args.model_dir, config=config)

return model, tokenizer

def predict_label(model, inputs):

    with torch.no_grad():
        outputs = model(**inputs)

logits = outputs[0]
logits = F.softmax(logits, dim=1)
logits_label = torch.argmax(logits, dim=1)
labels = logits_label.detach().cpu().numpy().tolist()
label_confidences = [
    confidence[label].item() for confidence, label in zip(logits, labels)
]

return labels, label_confidences


def predict(eval_dataloader, model, examples, device):

index = 0

labels_for_evaluations = []

for batch in tqdm(eval_dataloader, desc="Evaluating"):

    input_ids = batch["input_ids"]
    mask_ids = batch["mask_ids"]
    token_type_ids = batch["token_type_ids"]

    input_ids = input_ids.to(device, dtype=torch.long)
    mask_ids = mask_ids.to(device, dtype=torch.long)
    token_type_ids = token_type_ids.to(device, dtype=torch.long)
    inputs = {"input_ids": input_ids, "attention_mask": mask_ids}
    predicted_labels, label_confidences = predict_label(model, inputs)
    
    for confidence, pred_label in zip(label_confidences, predicted_labels):
        labels_for_evaluations.append(str(pred_label))
       
return labels_for_evaluations

if __name__ == "__main__":

examples, labels = read_tsv_file(args.data_file)
bert_model, tokenizer = load_model(args)
bert_model.to(args.device)

# perform quantization
quantized_model = quantization.quantize_dynamic(bert_model, {nn.Linear}, dtype=torch.qint8)

dataframe = pd.DataFrame({"text": examples})
batch_size = 1

print("quantized model ", quantized_model)
eval_dataloader = create_dataloader(
    dataframe, tokenizer, args.max_seq_length, batch_size, test_data=True
)

# inference
positive_predicted_sentences, labels_for_evaluations = predict(
    eval_dataloader, quantized_model, examples, args.device
)

# serialized the quatized model
quantized_output_dir = args.model_dir + "_quantized_batch1"

if not os.path.exists(quantized_output_dir):
    os.makedirs(quantized_output_dir)
    quantized_model.save_pretrained(quantized_output_dir)
    tokenizer.save_pretrained(quantized_output_dir)

print("accuracy score ", accuracy_score(labels, labels_for_evaluations))

Update

I found many people are facing similar issue, when you load the quantized BERT model then there is huge decrease in accuracy. Here are related issues on github

Dynamic Quantization on ALBERT (pytorch) #2542
Quantized model not preserved when imported using from_pretrained() #2556

hi Ramesh, would you be able to provide some more information? What is your model def, and what code are you using to save and load the model?

hi @Vasiliy_Kuznetsov thanks for your response. Please check the code.

I am loading quantized bert model in a similar way as we load the pre-trained bert model. When I convert model to quantization and determine the accuracy it is pretty much similar to without quantized model. However, when I load the quantized model, after saving it, then there is lot of variation in the results.

@Vasiliy_Kuznetsov I have updated the github issues link too, many people are facing similar issue.

1 Like

Hi @Ramesh_Kumar, could you also provide the code used to load the quantized model? Are you using the same load_model function, and how are you calling it?

Hi @Vasiliy_Kuznetsov thanks for your response. Yes, I am loading same load function to load the quantized model.

ah, I see. In that case, one place to check would be BertForSequenceClassification.from_pretrained - it might be assuming a floating point model. You would have to modify the loading code.

Hi,
I’m facing a similar issue when quantizing Efficientnet.
I opened a thread about it here, but i was wondering if you found any solutions for your problem

hi @Vasiliy_Kuznetsov could you please guide what modifications I have to do? I cannot find any leads regarding this. Thanks

hi @kfir_goldberg no i am still looking for the solution.

does the solution posted in https://github.com/huggingface/transformers/issues/2542 solve your problem?

hi @jerryzh168,

Thanks for your response. I have already checked that solution, but that is specifically for Albert. I am aiming for quantization of BERT.

Can you try the updated technique mentioned in https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html#serialize-the-quantized-model to save and load the quantized model?