Deploy Pruned Models

Hi, I want to deploy a pruned model but from what I’ve seen, when you prune a neural net, the library creates a mask and sets some weights to zero. What has to be done to actually reduce model size and inference time of the deployed pruned model?

Thanks in advance!

You’re correct in that pruning won’t actually reduce your model size and that whether a pruned model runs faster or not depends on the support you’re getting from your hardware.

If you’re more interested in reducing model size and decreasing latency you’re better off using quantization, distillation or exporting to some runtime like tensorRT or IPEX

Examples on each are here serve/experimental/torchprep at master · pytorch/serve · GitHub and GitHub - pytorch/serve: Serve PyTorch models in production

1 Like

I’m guessing you could prune as a first step and then inspect your (pruned) model to get a feel for which parts of your pipeline could be shrunk down without loss in performance. Meaning, if a bunch of the parameters at a certain layer are getting zeroed out by pruning, you can probably create an alternative net where that layer is smaller, and it should perform just as well (and that one would be of smaller size and training / inference time). I wonder if anyone has built a utility to do this.

1 Like