I have fine-tuned bert-base model with amp and without amp using
MAX_SEQ_LEN=512. I compared the performance among these models in terms of:
- Fine-tuning time
- Inference time on CPU/GPU
- Model size
While conducting first experiment, I observed that in terms of Fine-tuning time ,
bert model with amp performs better as compare to
However, when I compare the inference time and model size, both models have same inference time and model size.
Could anyone please explain why this is the case?
Are you using PyTorch dynamic quantization for the model?
The model size regarding its parameters won’t be changed, as the operations (and intermediates) will be casted to FP16 for “safe” ops.
So while you might be able to increase the batch size during training or inference, the
state_dict won’t be smaller in size.
Which batch size are you using for inference? If you are seeing a speedup during training, you should also see it during inference. However, if your batch size is low (e.g. a single sample), the performance gain might be too small compared to the overheads of launching all kernels.
@ptrblck thanks for your answer.
I have tried using different batch sizes e.g (8, 16, 64, 128). But I am not finding any difference.
Regarding the code, I am following examples here: https://pytorch.org/docs/stable/notes/amp_examples.html
Update: I am able to see the difference in inference time on
outputs = model(**inputs)
But when I compare the inference time on CPU, I do not notice any difference.
Automatic mixed precision is implemented for CUDA operations (and is thus in the
torch.cuda namespace). By appying amp your GPU could use TensorCores for certain operations, which would yield a speedup. I don’t know if anything like that is implemented for CPU operations (and if I’m not mistaken not all operations are implemented for
HalfTensors on the CPU).
thanks for your response. No, I am not using dynamic quantization. But, since I cannot do mixed precision on cpu. Hence, I guess for CPU I have to switch to dynamic quantization. Am I right?
So do you think, post-training quantization is better idea ? If we want to reduce inference time on cpu?
I’ve unfortunately never profiled the quantized models, so unsure what the expected speedup is.
However, please let us know once if you are using the post-training quantized models and how large the performance gain is.
@ptrblck just to give you update regarding dynamic quantization on cpu. There is issue with quantized bert model which I and many others are facing. Here is the github issue link : https://github.com/huggingface/transformers/issues/2542