Quantized model profiling

Hello,

I am trying to profile quantized models using torch.profiler APIs.
Are outputs of those APIs ( CPU_memory_usage, cpu_time, …) accurate?
(from what I understood torch.profiler is made for nn.Module type)

If it is not the case, are there other methods to profile quantized models?

Thank you,

I feel profiling this in CPU might be OK, but we have not extensively tested this, it is in the aten operator level so you will see ops like “quantized::linear”

@jerryzh168 How does one profile the quantized CPU models to actually understand the dtypes passed around / quant-dequant conversions (if happening anywhere) and see the backend (fbgemm/qnnpack/onednn) kernel calls? (besides quantized::linear?) to be able to understand what got fused and how exactly

Should we use the low-level Linux’s perf? Is there anywhere an example? Or can one use nsys and see these CPU function calls?

yeah nsys works, in general we don’t have a ready made tool for visualizing all of this, usually the torch.profiler or nsys are effective methods for debugging perf degredation

1 Like

Profiler would also be useful for novices to understand actually what’s going on and what actually got called / quant-dequanted where - for there’s a lot of layers of indirection… and torch.profiler only shows high-level information (which for being useful requires already understanding what’s supposed to happen)

Wonder if you tried using perf directly?

Hi @jerryzh168 I was using torch profiler to check the CPU memory usage by the resnet model vs quantized resnet model(int8 trained via qat) using

print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))

However the profiler gives me this result:

Non quantized resnet


                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  

                  aten::empty         0.49%       1.166ms         0.49%       1.166ms       2.982us      53.60 Mb      53.60 Mb           391  
                aten::resize_         0.10%     244.000us         0.10%     244.000us       7.394us      30.43 Mb      30.43 Mb            33  
aten::max_pool2d_with_indices         3.80%       8.960ms         3.80%       8.960ms       8.960ms       2.30 Mb       2.30 Mb             1  
   aten::_slow_conv2d_forward        33.26%      78.356ms        33.51%      78.944ms       2.392ms      31.39 Mb     980.00 Kb            33  
             aten::batch_norm         0.21%     495.000us         5.80%      13.657ms     257.679us      42.40 Mb     392.00 Kb            53  
                   aten::mean         0.02%      37.000us         0.07%     154.000us     154.000us       8.00 Kb       8.00 Kb             1  
                  aten::addmm         0.02%      48.000us         0.03%      60.000us      60.000us          36 b          36 b             1  
          aten::empty_strided         0.00%       5.000us         0.00%       5.000us       5.000us           4 b           4 b             1  
                 aten::conv2d         0.19%     451.000us        86.44%     203.648ms       3.842ms      42.40 Mb           0 b            53  
            aten::convolution         0.63%       1.479ms        86.25%     203.197ms       3.834ms      42.40 Mb           0 b            53  

quantized resnet:


                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  

                  aten::empty         2.32%       8.175ms         2.32%       8.175ms     151.389us      42.40 Mb      42.40 Mb            54  
aten::_empty_affine_quantized         0.37%       1.295ms         0.37%       1.295ms      17.986us      16.20 Mb      16.20 Mb            72  
    aten::quantize_per_tensor         0.90%       3.157ms         0.90%       3.157ms       3.157ms     147.00 Kb     147.00 Kb             1  
                  aten::addmm         0.02%      71.000us         0.03%      90.000us      90.000us          36 b          36 b             1  
                   aten::item         0.00%      10.000us         0.00%      16.000us       8.000us           0 b           0 b             2  
    aten::_local_scalar_dense         0.00%       6.000us         0.00%       6.000us       3.000us           0 b           0 b             2  
             aten::contiguous         0.01%      19.000us         0.09%     323.000us     323.000us     147.00 Kb           0 b             1  
                  aten::clone         0.08%     285.000us         0.09%     304.000us     304.000us     147.00 Kb           0 b             1  
                aten::qscheme         0.02%      69.000us         0.02%      69.000us       1.353us           0 b           0 b            51  
           aten::q_zero_point         0.02%      72.000us         0.02%      72.000us       0.679us           0 b           0 b           106  

I don’t see much of a difference in the memory usage, is this correct?

why aten::empty is taking most of the memory? maybe that just means the model is not memory bound so the reduction in weight memory is not important?

I just wanted to know at maximum what are the memory requirement of a model during inference, to validate the point of using a quantized model instead of a sota model.
Quantization does have an impact on CPU memory, right?

yeah quantization will have impact on memory, if the main memory usage in inference comes from model weight, and all these weights are quantized to int8 from fp32, you should see 4x memory reduction.

Then why the above results? I am using the standard transfer learning method with QAT. (beta) Quantized Transfer Learning for Computer Vision Tutorial — PyTorch Tutorials 2.0.1+cu117 documentation .

Is there any other way do check about the RAM usage for a sample tensor during inference?

@jerryzh168 when I switch from x86 to qnnpack in torch.backends.quantized.engine, I get this result for quantized model:


                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  

            quantized::conv2d        28.73%     119.856ms        28.75%     119.973ms       5.999ms       6.70 Mb       6.70 Mb            20  
          quantized::add_relu         1.23%       5.137ms         1.26%       5.252ms     328.250us       5.26 Mb       5.26 Mb            16  
       quantized::conv2d_relu        69.50%     289.980ms        69.67%     290.665ms       8.808ms       4.05 Mb       3.90 Mb            33  
aten::_empty_affine_quantized         0.02%      88.000us         0.02%      88.000us      29.333us     345.23 Kb     345.23 Kb             3  
    aten::quantize_per_tensor         0.09%     367.000us         0.09%     367.000us     367.000us     147.08 Kb     147.08 Kb             1  
                  aten::empty         0.01%      21.000us         0.01%      21.000us      21.000us       8.00 Kb       8.00 Kb             1  
                  aten::addmm         0.02%      84.000us         0.03%     116.000us     116.000us          36 b          36 b             1  
                   aten::item         0.01%      31.000us         0.01%      53.000us      26.500us           0 b           0 b             2  
    aten::_local_scalar_dense         0.01%      22.000us         0.01%      22.000us      11.000us           0 b           0 b             2  
             aten::contiguous         0.01%      24.000us         0.13%     555.000us     555.000us     147.08 Kb           0 b             1  

while nonquantized table remains the same.

I do get that CPU time increases as I am running a quantized model in qnnpack config on x86 while nonquantized model still uses x86 configs machine but this much drop in memory is hard to believe just by changing the configs.

why a sudden drop?
Also when we are considering memory footprint we consider the maximum among all layers or the sum?

can you print the quantized model for both x86 and qnnpack I think we need to check if they are the same model first

they are exactly the same, I compared them. Also, I am loading the same model and just changing

torch.backends.quantized.engine = 'x86'

to

torch.backends.quantized.engine = 'qnnpack'