No Difference in Model size of BERT fine-tuned with amp and without amp

Hello Everyone,

I have fine-tuned bert-base model with amp and without amp using MAX_SEQ_LEN=512. I compared the performance among these models in terms of:

  1. Fine-tuning time
  2. Inference time on CPU/GPU
  3. Model size

While conducting first experiment, I observed that in terms of Fine-tuning time , bert model with amp performs better as compare to without amp.

However, when I compare the inference time and model size, both models have same inference time and model size.

Could anyone please explain why this is the case?

Are you using PyTorch dynamic quantization for the model?

The model size regarding its parameters won’t be changed, as the operations (and intermediates) will be casted to FP16 for “safe” ops.
So while you might be able to increase the batch size during training or inference, the state_dict won’t be smaller in size.

Which batch size are you using for inference? If you are seeing a speedup during training, you should also see it during inference. However, if your batch size is low (e.g. a single sample), the performance gain might be too small compared to the overheads of launching all kernels.

@ptrblck thanks for your answer.

I have tried using different batch sizes e.g (8, 16, 64, 128). But I am not finding any difference.

Regarding the code, I am following examples here: https://pytorch.org/docs/stable/notes/amp_examples.html

Update: I am able to see the difference in inference time on GPU using

with autocast():
        with torch.no_grad():
            outputs = model(**inputs)

But when I compare the inference time on CPU, I do not notice any difference.

Automatic mixed precision is implemented for CUDA operations (and is thus in the torch.cuda namespace). By appying amp your GPU could use TensorCores for certain operations, which would yield a speedup. I don’t know if anything like that is implemented for CPU operations (and if I’m not mistaken not all operations are implemented for HalfTensors on the CPU).

thanks for your response. No, I am not using dynamic quantization. But, since I cannot do mixed precision on cpu. Hence, I guess for CPU I have to switch to dynamic quantization. Am I right?

So do you think, post-training quantization is better idea ? If we want to reduce inference time on cpu?

I’ve unfortunately never profiled the quantized models, so unsure what the expected speedup is.
However, please let us know once if you are using the post-training quantized models and how large the performance gain is. :slight_smile:

@ptrblck just to give you update regarding dynamic quantization on cpu. There is issue with quantized bert model which I and many others are facing. Here is the github issue link : https://github.com/huggingface/transformers/issues/2542