How does masking work to make neural networks use varying input lengths

A well known solution to make Neural Networks (NNs) work with varying lengths of input data is to use padding and then masking the padded inputs (padded with zero, a common approach in natural language processing). It’s being said that this approach makes the NN skip the padded inputs. I want to know what’s the mechanism behind such skipping. For me NNs are all about matrix multiplication. So here are my questions:

  1. Is masking done by forcing some weights to zero?
  2. How is it different than simply using the padded values that are zero without masking (zero times anything is zero again, hence I suppose the answer to my 1st question should be no, because there is no point in forcing the weights to zero when the input is zero itself)
  3. I assume masking didn’t suddenly emerge out of thin air, so I want to know who invented the method? Are there any academic papers talking about it?