Keeping all BERT embeddings of input sequence instead of CLS

Hi there,

I’m currently implementing a model in which I need to compute a cross-attention between sequences from different modalities. For the textual sequence, I thus want to use BERT to embed my raw sentences into textual embeddings.
I use the transformers package this way to initialize my encoder:

bertconfig = BertConfig.from_pretrained('bert-base-uncased',
output_hidden_states=True)
self.bertmodel = BertModel.from_pretrained('bert-base-uncased', config=bertconfig)

As I want to keep the cross-attention between modalities on the whole sequence (and observe the words triggering the most the attention weights), I currently keep the embeddings of all input words, rather than only the one corresponding to the CLS token. Is it OK to proceed this way ?

Plus, is it OK if I then slice my input sequence (composed of all BERT embeddings) into different smaller sequences and work locally on each of these subsequences ? I mean that in the sense of the meaningfullness of the BERT embeddings of the different subsequences, as the initial embeddings are computed contextually, on the whole initial sequence, and with adequate positional encodings.

I ask these questions because my model seems not to learn properly with this approach, whereas it successfully learns when I use GloVe embeddings.

Thank you in advance for your answer !