No Difference in Model size of BERT fine-tuned with amp and without amp

Ramesh_Kumar · June 24, 2020, 11:32am

Hello Everyone,

I have fine-tuned bert-base model with amp and without amp using MAX_SEQ_LEN=512. I compared the performance among these models in terms of:

Fine-tuning time
Inference time on CPU/GPU
Model size

While conducting first experiment, I observed that in terms of Fine-tuning time , bert model with amp performs better as compare to without amp.

However, when I compare the inference time and model size, both models have same inference time and model size.

Could anyone please explain why this is the case?

hx89 · June 24, 2020, 6:14pm

Are you using PyTorch dynamic quantization for the model?

ptrblck · June 25, 2020, 6:22am

The model size regarding its parameters won’t be changed, as the operations (and intermediates) will be casted to FP16 for “safe” ops.
So while you might be able to increase the batch size during training or inference, the state_dict won’t be smaller in size.

Which batch size are you using for inference? If you are seeing a speedup during training, you should also see it during inference. However, if your batch size is low (e.g. a single sample), the performance gain might be too small compared to the overheads of launching all kernels.

Ramesh_Kumar · June 26, 2020, 10:23am

@ptrblck thanks for your answer.

I have tried using different batch sizes e.g (8, 16, 64, 128). But I am not finding any difference.

Regarding the code, I am following examples here: https://pytorch.org/docs/stable/notes/amp_examples.html

Update: I am able to see the difference in inference time on GPU using

with autocast():
        with torch.no_grad():
            outputs = model(**inputs)

But when I compare the inference time on CPU, I do not notice any difference.

ptrblck · June 27, 2020, 8:19am

Automatic mixed precision is implemented for CUDA operations (and is thus in the torch.cuda namespace). By appying amp your GPU could use TensorCores for certain operations, which would yield a speedup. I don’t know if anything like that is implemented for CPU operations (and if I’m not mistaken not all operations are implemented for HalfTensors on the CPU).

Ramesh_Kumar · June 29, 2020, 8:19am

thanks for your response. No, I am not using dynamic quantization. But, since I cannot do mixed precision on cpu. Hence, I guess for CPU I have to switch to dynamic quantization. Am I right?

Ramesh_Kumar · June 29, 2020, 8:19am

So do you think, post-training quantization is better idea ? If we want to reduce inference time on cpu?

ptrblck · June 30, 2020, 1:30am

I’ve unfortunately never profiled the quantized models, so unsure what the expected speedup is.
However, please let us know once if you are using the post-training quantized models and how large the performance gain is.

Ramesh_Kumar · July 7, 2020, 8:37am

@ptrblck just to give you update regarding dynamic quantization on cpu. There is issue with quantized bert model which I and many others are facing. Here is the github issue link : https://github.com/huggingface/transformers/issues/2542