I believe that using dropout should speed up training a lot, because the model stops computing parts of the model. However, empirically it seems not to be the case. The training time is roughly the same if I use really high dropout or not.
It doesn’t stop calculating parts of the model, it just substitutes zeros in many places and does the calculations with the zeros and all the other values.
Calculations with sparse matrices are not easy to do efficiently and pytorch doesn’t use them for dropout.
Thanks that is really helpful.
And usual values for dropout (e.g. 50%) is not sparse enough to benefit from using sparse tensors. Keeping them as dense tensors often plays much better with the following operations.