Reverse Padding



I am currently implementing an LSTM for tweet classification. I noticed something strange: if I pad sequences on the right (that is, with the zeros after the last word index of a sentence to the max sentence length), the model does not learn anything, just always assigns the majority class. On the contrary, if I pad on the left (i.e., zero are before the first word index of the sentence) it works perfectly. Notice that this happens both with a uni-directional and bi-directional LSTM. Can you help me understand the reason for this?