A well known solution to make Neural Networks (NNs) work with varying lengths of input data is to use padding and then masking the padded inputs (padded with zero, a common approach in natural language processing). It’s being said that this approach makes the NN skip the padded inputs. I want to know what’s the mechanism behind such skipping. For me NNs are all about matrix multiplication. So here are my questions:
- Is masking done by forcing some weights to zero?
- How is it different than simply using the padded values that are zero without masking (zero times anything is zero again, hence I suppose the answer to my 1st question should be no, because there is no point in forcing the weights to zero when the input is zero itself)
- I assume masking didn’t suddenly emerge out of thin air, so I want to know who invented the method? Are there any academic papers talking about it?