Let’s say we have dropout or something else that zero’s out part of the state space. Would a forward pass and backward pass still take the same time compared to the non-zero’d out input?

And if so, why. We should at least be able to avoid matrix multiplications and just sum the biases. There should be a shortcut.

Note: I am not talking about ‘mostly zero’ sparse matrices. I am talking about something more fundamental; you do not need to multiply by a matrix if the input is zero.

In my honest opinion, It comes down to a trade off. If you want to conditionally compute some operation, then the cost of the condition should also be considered.

Total Cost of operations = Cost of conditional check*Number of elements + Cost of multiply-add * Number of non-zero elements

In most cases, The reduction in cost due to multiply add is not enough to compensate the additional cost of conditional checks.

For sparse tensors, where more than 70% of the elements are zero, the trade off favors having checks.

Assumption: Cost of multiply-add of 32-bit number is O(32 squared) while cost of checking whether 32-bit number is non-zero is O(32).