Gradient is not zero at position where input is masked

During DDP training, I prepend my input sequence with 23 pad tokens, and set their attention mask to 0. However, when I print the gradient of the hidden state of the 22th position (using zero-based indexing), it does not return 0, while gradients of the preceding 22 positions are all 0. Why is that?

Also, I noticed that the gradient of the hidden state corresponding to the last input token is also always 0, is there a reason for that?