Quantization of T5-small

Hi all, I have applied dynamic quantization on my T5-small model (that I have built using AllenNLP framework). However, I see inconsistent latency numbers on different hardware on the same set/batch-size. On an m5 instance, I observe 10-15% speed up, while on c5.9xlarge, I actually see an increase in latency compared to the unquantized version. I chose dynamic quantization as I was expecting it to more or less work out-of-the-box.

Has anyone run into this issue before?

on servers, we are using fbgemm, which is using intel SIMD intrinsics to speedup the runtime I think. cc @dskhudia do we know if fbgemm always gives a speedup in aws machines?