In typical sequence to sequence task and language modeling task, the decoder is usually masked with triangular matrix to enforce that any word on the target sequence can only attend to the words before it. Is it mainly for computational efficiency?
E.g.: My target sequence is ABCDEFG (source doesn’t matter). When the decoder tries to predict F, why can’t A/B/C/D/E attend to each other to create a better context for F?
One obvious drawback of this approach is that to predict the next word G, all previous attention needs to be recomputed. So is this why we mask target sequence in the decoder to only allow single direction attention? If we allow the already decoded target tokens to attend to each other, we could achieve better prediction for the next token (with much higher cost though), right?