Why training loss floats so much

batch size=40
gpu num=2
learning rate = 3e-5
The model is based on bert and is used to complete simple binary classification task.

In previous experiments, I set batch size to 16, and the situation is similar.
Does anyone have a good suggestion to change this situation?