The Bert outputs two things :-
last_hidden_state: contains the hidden representations for each token in each sequence of the batch. So the size is
(batch_size, seq_len, hidden_size).
pooler_output contains a “representation” of each sequence in the batch, and is of size
(batch_size, hidden_size). What it basically does is take the hidden representation of the [CLS] token of each sequence in the batch
hidden which is the embedding of 1st token ie: CLS be same as pooled as pooled is also the embedding of CLS?
While I have checked using code they both are not same. Why is it so? Am I missing something?
BERT includes a linear + tanh layer as the pooler. I recently wrote a very compact implementation of BERT Base that shows what is going on. L354 you have the pooler, below is the BERT model. The
last_hidden_state is the output of the blocks, you can set
torch.nn.Identity() to get these, as shown in the test which shows how to import BERT from the HF transformer library into this simple transformer implementation. (I also have a DistillBert version but not yet pushed to github. DistilBert drops the pooler. Interestingly, the models are coded quite differently internally, even though they use the same computation…)
excuse me if I need to use BERT in the image captioning model instead of LSTM ? How can I make these changes?