LSTM returns nan after using the pretrained BERT embedding as input

Hello everyone ,
I’m working in personality detection model and i’m using the pretrained Bert model (from pytorch-transformers) to get the contextual embedding of a written text , i summed the last 4 hidden layers outputs (i red that the concatenation of the last four layers usually produce the best results )
than i use a LSTM layer with attention to get the paragraph level embedding from the word embedding produced by the Bert model
the output should be a score for a specific personality trait , let’s say extraversion trait with score range [-1,1] ( my training data is text with scores for each trait)
I tried the rmsprop , adam … optimizers with MSEloss and always after just one or few batch iterations the lstm layer produces nan values .
Voila an embedding example that i got after the Bert output summation :
tensor([ 3, 5310, 26897, 944, 479, 19437, 944, 167, 4813, 2147,
2249, 8082, 229, 24490, 21, 115, 128, 788, 5044, 2036,
1080, 870, 3165, 1979, 2036, 127, 2197, 128, 7155, 8708,
11977, 4813, 7774, 155, 20062, 14, 309, 17868, 27, 2036,
30, 870, 1632, 309, 7746, 1475, 4813, 295, 14309, 30,
7812, 2036, 6076, 260, 2249, 7230, 22330, 309, 12269, 19437,
5034, 1882, 194, 8026, 106, 7173, 479, 9979, 4526, 2036,
267, 7931, 5603, 14207, 10703, 2036, 260, 2249, 4385, 1450,
430, 261, 1480, 14303, 26898, 2036, 1480, 7513, 962, 2036,
17396, 9964, 542, 107, 11977, 4813, 482, 4835, 1480, 3168,
235, 1419, 81, 4526, 2036, 2896, 417, 260, 12143, 1788,
81, 474, 2036, 2896, 417, 142, 13398, 15424, 26902, 309,
1786, 4813, 261, 18286, 944, 1471, 478, 260])

the values range looks too wide , do you think that that this the reason of this nan results , any suggestions will be greatly appreciated

the bert output maybe unreasonable.
why:
1 the values are all int type
2 the values are all too wide and too big