I’m using the TransformerEncoder module and have noticed that in eval mode, I get the same output sequence regardless of input. I’ve tracked this down to the MultiHeadAttention module, specifically line 3112:
q, k, v = linear(query, in_proj_weight, in_proj_bias).chunk(3, dim=-1)
where q gets squashed to the same values for every position in the input sequence. This doe not happen when the dropout is active (i.e., during training). I suppose this could just be due to overfitting, but I was wondering if anyone had encountered a similar problem or had ideas for what else could be causing it.