Pruning vs Dropout

aulakh83 · August 29, 2021, 6:11pm

How weight pruning is different from dropout?

tom · August 29, 2021, 6:14pm

Dropout drops certain activations stochastically (i.e. a new random subset of them for any data passing through the model). Typically this is undone after training (although there is a whole theory about test-time-dropout).

Pruning drops certain weights, i.e. permanently drops some parts deemed “uninteresting”.

aulakh83 · August 29, 2021, 6:32pm

Thank you for quick response. I applied global pruning over my model and then retrain it, during test the weight parameter is not there. i think those are renamed. then how can i test the pruned model??

KFrank · August 29, 2021, 6:33pm

Hi Tom!

What is your intuition about why (and when) dropout works?

Thanks.

K. Frank

tom · August 29, 2021, 7:43pm

I’m a Bayesian at heart, so to me Gal and Ghahramani interpretation seems to be a thing.
In a later paper they show that for RNNs it makes sense to keep the dropout mask fixed between the timesteps to use this interpretation and Gal’s theses has a lot of things on the theme (including “why can we not optimize the dropout probability somewhere, I think”). More generally, it would seem that the hypothetical variational model must make sense as well as the approximation.

What is your take? I always love to hear your thoughts on the more theory-grounded topics.

Best regards

Thomas

KFrank · August 30, 2021, 12:22am

Hi Thomas!

Thanks for the Gal and Ghahramani link. I can’t say that I’ve absorbed
much of it yet.

I’m not well-studied on dropout, and I just don’t have any intuition about
it that feels “right” to me, so I don’t really have a take.

Best.

K. Frank

aulakh83 · August 30, 2021, 1:20am

Can anyone please help in this??
I applied global pruning over my model and then retrain it, during test the weight parameter is not there. i think those are renamed. then how can i test the pruned model??

tom · August 30, 2021, 6:01am

Sorry for diverting your thread.
Yes, the pruning seems to rename weights (appending _orig probably is the most common), you can get the current set of weight names using

print([n for n, p in module.named_parameters()])

There is an equivalent named_buffers for non-parameters that gets masks.

Also, the tutorial section on removing the reparametrization in the tutorial might be of use.

Best regards

Thomas

aulakh83 · September 1, 2021, 5:02pm

Tom thank you for providing relevant content. I will try this and will share the result

aulakh83 · September 1, 2021, 5:17pm

I tried to apply global pruning over my model which is having one lstm layer. I am getting
UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().

then i put flatten_parameters in forward of the module. But getting the same. Can u plz help

tom · September 3, 2021, 6:21am

I think this is inherent in the way pruning is implemented using reparametrization.

One way to overcome the error message could be to have a workflow where you do pruning and then remove the reparametrization.

The other option is to make it work with non-flattened inputs, i.e. either

to silence or avoid the warning using the Python warnings mechanism or
to disable CuDNN, if the flattening and Python warning overhead is larger than the savings by CuDNN optimizations,
rolling your own JIT fused RNN might be another option, but sadly it has never become as much the standard as one might have hoped.

I can see how neither option is terribly attractive but it seems that things are not working together as well as they should.

Best regards

Thomas

aulakh83 · September 4, 2021, 5:49am

ok. Theoretically removing the reparameterization is appealing. so I am trying with the same and Will inform about results when done.

aulakh83 · September 23, 2021, 2:30pm

removing reparamterization worked well for me. thanks Thomas.

aulakh83 · September 23, 2021, 2:33pm

Next issue i am facing is: if i 90% globally prune the model in one-shot it is pruned approximately 90% with different pruning rate in different layers. But iteratively pruning the model with factor 0.1 for 10 iterations, i am getting the model pruned approximately upto 65.6 %. What is the reason behind it?? I thought it to be pruned approximately 99%.

tom · September 25, 2021, 9:02pm

Are you randomly pruning? If so I would suspect that the pruning is considering all weights (including previously pruned) in the computation: If I randomly set 0.1 of the weights to 0 for 10 iterations, I expect (in lieu of LaTeX support in the forum) sum([0.1 * (0.9**i) for i in range(10)]) ~ 65.1% of weights to be set to 0.
This appears to be consistent with the documentation in Pruning Tutorial — PyTorch Tutorials 1.9.1+cu102 documentation . If you have some other criterion, maybe it is applied to the original weights. I have to admit that I don’t know how it exactly works from the top of my head.

Best regards

Thomas

aulakh83 · September 29, 2021, 1:28am

I am pruning the model weights using
prune.global_unstructured(
parameters_to_prune,
pruning_method=prune.L1Unstructured,
amount=0.10,
)

for 10 iterations. Your explanation makes sense that it must be considering the pruned weights also. Then how can i make the model not to consider those again in other iterations. One more question how can i calculate the inference time of the model

tom · September 29, 2021, 9:12am

One thing that might be worthwhile to try is progressively increasing the amount if that matches what you want to achieve.

I don’t have a good answer, but two thoughts:

I don’t think unstructured sparsity is something lending itself easily to performance gains. (Though at 90%ish it might just work, but I would not know how to do it.)
I would probably measure time.

Best regards

Thomas

aulakh83 · October 7, 2021, 8:31am

Thank you Thomas, I started experimenting these thoughts that why took long to reply. But today i got the results, progressive increasing the amount till i reach threshold worked well and i am able to speed up the model by measuring the time before inference call and after that. My aim was not to have performance gains but just to reduce model size that i am able to achieve thanks