I’m recently working on Transformer things(Self-Attention, Mutihead Attention).
But I am genuinely curious
whether the Embedding layer in the transformer is trained to have similarity suck like Skip-gram, CBOW
or it’s just initialized random vector.
Does it use pre-trained vocab?
Thx for reading.