Why does the quantization after pruning results in worse performance?

I have resnet18 model which I quantized using graph quantization, each forward pass on my system takes about 100 ms (on cpu) and its size shrunk from 85 to 45 MBs.
I then went on and poruned this model to 14.5M parameters from the initial 25M and its size shrunk from 85 to 58MB. I then went on and quantized the resulting model hoping for improvements.
but in fact I’m seeing dimnishing results. That is, I dont see what I expected in inference speed. instead of being better, the newer model simply is worse than the initial model (quantized from initial model)

Is this expected behavior?
Here are the two models for comparision :

and this is their runtime benchmark results :

pruned-quantized model :

1>[ RUN      ] EmbedderModelForwardFixture.ModelEmbedderBench (10 runs, 10 iterations per run)
1>[     DONE ] EmbedderModelForwardFixture.ModelEmbedderBench (5778.116020 ms)
1>[   RUNS   ]        Average time: 577811.602 us (~26296.168 us)
1>                    Fastest time: 537246.562 us (-40565.040 us / -7.020 %)
1>                    Slowest time: 617859.662 us (+40048.060 us / +6.931 %)
1>                     Median time: 585275.362 us (1st quartile: 554311.262 us | 3rd quartile: 594753.362 us)
1>
1>             Average performance: 1.73067 runs/s
1>                Best performance: 1.86134 runs/s (+0.13067 runs/s / +7.55054 %)
1>               Worst performance: 1.61849 runs/s (-0.11218 runs/s / -6.48174 %)
1>              Median performance: 1.70860 runs/s (1st quartile: 1.80404 | 3rd quartile: 1.68137)
1>
1>[ITERATIONS]        Average time: 57781.160 us (~2629.617 us)
1>                    Fastest time: 53724.656 us (-4056.504 us / -7.020 %)
1>                    Slowest time: 61785.966 us (+4004.806 us / +6.931 %)
1>                     Median time: 58527.536 us (1st quartile: 55431.126 us | 3rd quartile: 59475.336 us)
1>
1>             Average performance: 17.30668 iterations/s
1>                Best performance: 18.61343 iterations/s (+1.30675 iterations/s / +7.55054 %)
1>               Worst performance: 16.18491 iterations/s (-1.12177 iterations/s / -6.48174 %)
1>              Median performance: 17.08597 iterations/s (1st quartile: 18.04041 | 3rd quartile: 16.81369)

quantized from normal model (no pruning done beforehand) :

1>[ RUN      ] EmbedderModelForwardFixture.ModelEmbedderBench (10 runs, 10 iterations per run)
1>[     DONE ] EmbedderModelForwardFixture.ModelEmbedderBench (5672.357520 ms)
1>[   RUNS   ]        Average time: 567235.752 us (~31674.053 us)
1>                    Fastest time: 530900.462 us (-36335.290 us / -6.406 %)
1>                    Slowest time: 640024.562 us (+72788.810 us / +12.832 %)
1>                     Median time: 561095.762 us (1st quartile: 548392.562 us | 3rd quartile: 577176.062 us)
1>
1>             Average performance: 1.76294 runs/s
1>                Best performance: 1.88359 runs/s (+0.12066 runs/s / +6.84409 %)
1>               Worst performance: 1.56244 runs/s (-0.20050 runs/s / -11.37282 %)
1>              Median performance: 1.78223 runs/s (1st quartile: 1.82351 | 3rd quartile: 1.73257)
1>
1>[ITERATIONS]        Average time: 56723.575 us (~3167.405 us)
1>                    Fastest time: 53090.046 us (-3633.529 us / -6.406 %)
1>                    Slowest time: 64002.456 us (+7278.881 us / +12.832 %)
1>                     Median time: 56109.576 us (1st quartile: 54839.256 us | 3rd quartile: 57717.606 us)
1>
1>             Average performance: 17.62935 iterations/s
1>                Best performance: 18.83592 iterations/s (+1.20657 iterations/s / +6.84409 %)
1>               Worst performance: 15.62440 iterations/s (-2.00495 iterations/s / -11.37282 %)
1>              Median performance: 17.82227 iterations/s (1st quartile: 18.23511 | 3rd quartile: 17.32574)

Or to put it simply after 10 iterations:

r18_default : 805.72 ms (mean) 
quantized_model : 560 ms (mean)
r18_pruned : 7,466.78  ms
pruned_then_quantized: 578 ms (mean) 

Not only the second model is not faster, its worse, its become slower!
You can also see that the pruned model is extremely slow! 10x slower than the default model!

Note:
in case it matters, training (pruning, and finetuning the model) is done using pytorch 1.5.1 and the final graph quantization is done in windows using pytorch 1.6

Note2:
This is being tested and evaluted using libtorch(1.6) on windows 10 machine.

I’d greatly appreciate any kind of feedback on this.
Thank you all in advance

Pruning is a good way to reduce model size, but it won’t automatically give you performance improvements.

The reason is that when pruning a model you’re introducing zeros to Tensors (i.e. they become sparse). These tensors can be compressed (with some extra overhead) without much problem. Hence the reduced model size.

However, the underlying kernels (i.e. code that operates on the Tensors holding the values, like a matrix multiply) are by default written assuming dense Tensors, because hardware provides instructions to make those dense operations fast (e.g via vectorized on adjacent values in memory).

Unfortunately there’s not a lot of hardware, including CPUs, that can benefit directly from sparse Tensors.

There’s some work arounds to still speeding up sparse operations in some cases. In CPU, for example, doing structured pruning (i.e. removing blocks of values rather than one by one) in a way that matches the size of the registers used in those vectorized operations such that you can skip entire blocks of continuous zeros. For example, pruning blocks of 1x16 or 4x4 and quantizing afterwards, matches the register sizes of vectorized instructions in CPUs (128bit).

However, even if you use structured pruning, you will still depend on having underlying kernels for the operations used in your model being implemented to take advantage of the structured nature of your tensors (i.e. to know that they can skip entire sequences of 16 values when doing things like a matrix multiply).

Why is it slower?
In theory a pruned model shouldn’t be slower since the dense operations could ignore tensors coming in that happen to have a bunch of zeros. However, if a special representation is used to represent the sparse Tensor (i.e. to avoid having to represent the zeros), as it’s likely the case here, then that will cause inefficiencies when operating with those special sparse Tensors (e.g. some copying of values may be needed to restore the original structure of the Tensor to go through the dense kernels).

Hope this helps.

3 Likes

Thanks a lot really appreciate it.
But when the whole channel is removed, we are basically left with a dense model at the end, so we still need to be able to benefit from the dense operation efficiency! In my case I’m using Torch-Pruning which removes entire channels and not setting neurons to zeros only as far as I understand. So to me it should be able to run as fast at the very least.
Also note that the final pruned model is then retrained therefore even if we have sparse neurons (set to zero) at first, we should be dealing with a dense model at the end that shouldn’t be performing like this.
Am I missing something here?

Could you print out and check the dimensions of the weight for the pruned quantized model during inference? If they are the same as before pruning you should see a similar performance.

1 Like

Here is the live run of the process, showing the pruned model is indeed faster, but the quantized model for some reason is extremely slow :


I also tested this with another model, and the outcome is the same. the quantized model takes much more to finish:

Note :
As you can see, we do not finetune the pruned model here. we are just testing whether the pruning by nature results in faster inference in our case or not, and as t he results show, it indeed does.
Finetuning an already pruned model for some reason, results in a sever slow down at inferece as you can see.