I am trying to profile quantized models using torch.profiler APIs.
Are outputs of those APIs ( CPU_memory_usage, cpu_time, …) accurate?
(from what I understood torch.profiler is made for nn.Module type)
If it is not the case, are there other methods to profile quantized models?
I feel profiling this in CPU might be OK, but we have not extensively tested this, it is in the aten operator level so you will see ops like “quantized::linear”
@jerryzh168 How does one profile the quantized CPU models to actually understand the dtypes passed around / quant-dequant conversions (if happening anywhere) and see the backend (fbgemm/qnnpack/onednn) kernel calls? (besides quantized::linear?) to be able to understand what got fused and how exactly
Should we use the low-level Linux’s perf? Is there anywhere an example? Or can one use nsys and see these CPU function calls?
yeah nsys works, in general we don’t have a ready made tool for visualizing all of this, usually the torch.profiler or nsys are effective methods for debugging perf degredation
Profiler would also be useful for novices to understand actually what’s going on and what actually got called / quant-dequanted where - for there’s a lot of layers of indirection… and torch.profiler only shows high-level information (which for being useful requires already understanding what’s supposed to happen)
I just wanted to know at maximum what are the memory requirement of a model during inference, to validate the point of using a quantized model instead of a sota model.
Quantization does have an impact on CPU memory, right?
yeah quantization will have impact on memory, if the main memory usage in inference comes from model weight, and all these weights are quantized to int8 from fp32, you should see 4x memory reduction.
I do get that CPU time increases as I am running a quantized model in qnnpack config on x86 while nonquantized model still uses x86 configs machine but this much drop in memory is hard to believe just by changing the configs.
why a sudden drop?
Also when we are considering memory footprint we consider the maximum among all layers or the sum?