Not sure what I’m doing wrong or if it’s an expected behavior.
My experiment is to take a trained ResNet-101 and then prune it using the pytorch functionality in torch.nn.utils.prune, I’m currently using the code provided in the official turorial so nothing fancy yet.
The issue is that, as far as I monitored, the inference speed and the memory footprint remain exactly the same as the original unpruned model, which is weird for me.
# model has been loaded properly before
for name, module in model.named_modules():
# prune 20% of connections in all 2D-conv layers
if isinstance(module, torch.nn.Conv2d):
prune.l1_unstructured(module, name='weight', amount=0.2)
prune.remove(module, 'weight') # make it permanent
# prune 40% of connections in all linear layers
elif isinstance(module, torch.nn.Linear):
prune.l1_unstructured(module, name='weight', amount=0.4)
For timing I use :
start = time.time()
# get predictions
output = model(inputs)
end = time.time()
print(end - start)
For memory I use :
memory = 0
for p in model.parameters():
memory += 32 * p.nelement()
Am I doing something wrong ? How can I achieve speed-up using torch.nn.utils.prune ?
As far as I am aware the pruning module does not prune models by removing weights or something. Instead, it adds new parameters called weight_orig and weight_mask. The weight_orig parameter stores the original weights. The pruning is performed by applying the weight_mask, which is a tensor of 0s and 1s depending on whether that weight is pruned, in the forward and backward passes to mask them out (by using forward_pre_hook etc.). So pruning currently actually requires a bit more memory and a bit more computation.
Since the msg board is full of pruning today… a while back, a researcher submitted a PR to one of my repos with some compelling pruned models (including a ResNet101 variant). They used a technique based on this paper: https://arxiv.org/abs/2002.08258
The pruned variants are created by applying an overlay in a txt file that defines the modifications to make to the original network, which was really easy to integrate. The end result is a network that is faster than the original and typically a model from the family one step down, while mainting a higher accuracy than the step down model. Unfortunately the code to generate he pruned models wasn’t included, but perhaps if someone asked they’d be willing to share.