Do any of the recent transformer models work like this?

'I ate __ apple' # task is to predict what word fills the blank

                      I ate a apple      # but correct phrase is 'I ate an apple', so we get a loss, and backprop
                    emb(I ate a apple)

               I ate __     (there is a blank, so we do softmax on all words)       apple
            emb(I ate __)   (, suppose 'a' got highest probability          )     emb(apple)
                                  
    I ate     softmax(   a     an     ...(all list of possible words))          a      p       p      l     e
  emb(I ate)         ( emb(a) emb(an) ...                            )      emb(a)  emb(p) emb(p) emb(l) emb(e)
                      
  I      ate              a        n 
emb(I) emb(ate)         emb(a)    emb(n)

        a      t      e
      emb(a) emb(t)  emb(e) 

use of (character, word, phrase, sentence level embeddings)

the way to combine two embeddings together could be using another weight matrix in middle like

emb(I ate) = emb(I) * weight_matrix * emb(ate)