Why training loss floats so much

DoubtWang · January 8, 2020, 4:19am

Smoothing=0.65
batch size=40
gpu num=2
learning rate = 3e-5
The model is based on bert and is used to complete simple binary classification task.

In previous experiments, I set batch size to 16, and the situation is similar.
Does anyone have a good suggestion to change this situation?