Reduction in performance of quantized bert model

Ramesh_Kumar · July 6, 2020, 11:29am

I am using dynamic quantization on fine-tuned bert model. When I performed inference on quantized model before saving it, I am getting almost similar results(accuracy score) between unquantized and quantized model and reduction in inference time too.

However, when I load the quantized model and do inference on that, there is significant difference (around 30 to 40% decrease in accuracy) in the results, Is this because of way of loading the quantized model?

Any leads will be appreciable…
Thanks

Following is the code

def load_model(args):

config = BertConfig.from_pretrained(args.model_dir)
tokenizer = BertTokenizer.from_pretrained(
    args.model_dir, do_lower_case=args.do_lower_case
)
model = BertForSequenceClassification.from_pretrained(args.model_dir, config=config)

return model, tokenizer

def predict_label(model, inputs):

    with torch.no_grad():
        outputs = model(**inputs)

logits = outputs[0]
logits = F.softmax(logits, dim=1)
logits_label = torch.argmax(logits, dim=1)
labels = logits_label.detach().cpu().numpy().tolist()
label_confidences = [
    confidence[label].item() for confidence, label in zip(logits, labels)
]

return labels, label_confidences


def predict(eval_dataloader, model, examples, device):

index = 0

labels_for_evaluations = []

for batch in tqdm(eval_dataloader, desc="Evaluating"):

    input_ids = batch["input_ids"]
    mask_ids = batch["mask_ids"]
    token_type_ids = batch["token_type_ids"]

    input_ids = input_ids.to(device, dtype=torch.long)
    mask_ids = mask_ids.to(device, dtype=torch.long)
    token_type_ids = token_type_ids.to(device, dtype=torch.long)
    inputs = {"input_ids": input_ids, "attention_mask": mask_ids}
    predicted_labels, label_confidences = predict_label(model, inputs)
    
    for confidence, pred_label in zip(label_confidences, predicted_labels):
        labels_for_evaluations.append(str(pred_label))
       
return labels_for_evaluations

if __name__ == "__main__":

examples, labels = read_tsv_file(args.data_file)
bert_model, tokenizer = load_model(args)
bert_model.to(args.device)

# perform quantization
quantized_model = quantization.quantize_dynamic(bert_model, {nn.Linear}, dtype=torch.qint8)

dataframe = pd.DataFrame({"text": examples})
batch_size = 1

print("quantized model ", quantized_model)
eval_dataloader = create_dataloader(
    dataframe, tokenizer, args.max_seq_length, batch_size, test_data=True
)

# inference
positive_predicted_sentences, labels_for_evaluations = predict(
    eval_dataloader, quantized_model, examples, args.device
)

# serialized the quatized model
quantized_output_dir = args.model_dir + "_quantized_batch1"

if not os.path.exists(quantized_output_dir):
    os.makedirs(quantized_output_dir)
    quantized_model.save_pretrained(quantized_output_dir)
    tokenizer.save_pretrained(quantized_output_dir)

print("accuracy score ", accuracy_score(labels, labels_for_evaluations))

Update

I found many people are facing similar issue, when you load the quantized BERT model then there is huge decrease in accuracy. Here are related issues on github

Dynamic Quantization on ALBERT (pytorch) #2542
Quantized model not preserved when imported using from_pretrained() #2556

Vasiliy_Kuznetsov · July 6, 2020, 6:07pm

hi Ramesh, would you be able to provide some more information? What is your model def, and what code are you using to save and load the model?

Ramesh_Kumar · July 6, 2020, 8:16pm

hi @Vasiliy_Kuznetsov thanks for your response. Please check the code.

I am loading quantized bert model in a similar way as we load the pre-trained bert model. When I convert model to quantization and determine the accuracy it is pretty much similar to without quantized model. However, when I load the quantized model, after saving it, then there is lot of variation in the results.

@Vasiliy_Kuznetsov I have updated the github issues link too, many people are facing similar issue.

Vasiliy_Kuznetsov · July 15, 2020, 5:58pm

Hi @Ramesh_Kumar, could you also provide the code used to load the quantized model? Are you using the same load_model function, and how are you calling it?

Ramesh_Kumar · July 24, 2020, 12:38pm

Hi @Vasiliy_Kuznetsov thanks for your response. Yes, I am loading same load function to load the quantized model.

Vasiliy_Kuznetsov · July 24, 2020, 10:44pm

ah, I see. In that case, one place to check would be BertForSequenceClassification.from_pretrained - it might be assuming a floating point model. You would have to modify the loading code.

kfir_goldberg · July 31, 2020, 2:37pm

Hi,
I’m facing a similar issue when quantizing Efficientnet.
I opened a thread about it here, but i was wondering if you found any solutions for your problem

Ramesh_Kumar · August 6, 2020, 1:45pm

hi @Vasiliy_Kuznetsov could you please guide what modifications I have to do? I cannot find any leads regarding this. Thanks

Ramesh_Kumar · August 6, 2020, 1:45pm

hi @kfir_goldberg no i am still looking for the solution.

jerryzh168 · August 21, 2020, 10:24pm

does the solution posted in https://github.com/huggingface/transformers/issues/2542 solve your problem?

Ramesh_Kumar · September 30, 2020, 7:38am

hi @jerryzh168,

Thanks for your response. I have already checked that solution, but that is specifically for Albert. I am aiming for quantization of BERT.

supriyar · October 8, 2020, 11:10pm

Can you try the updated technique mentioned in https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html#serialize-the-quantized-model to save and load the quantized model?