Bias_k, bias_v, add_zero_attn in MultiheadAttention

I need some help to catch up on these… I didn’t recall seeing these params in the original paper and apparently they’re in the pytorch code. Could someone enlighten me where they’re originated and/or why we need them? Thank you,

1 Like