I am using the excellent HuggingFace implementation of BERT in order to do some multi label classification on some text. I basically adapted his code to a Jupyter Notebook and change a little bit the BERT Sequence Classifier model in order to handle multilabel classification. However, my loss tends to diverge and my outputs are either all ones or all zeros.
There is no input in my dataset such as all labels are zeros and the labels distribution in my train dataset is :
array([ 65, 564, 108, 17, 40, 26, 306, 195, 25, 345, 54, 80, 214])
I am using the Adam Optimizer on the BCEWithLogitsLoss and I am unable to figure out where the problem comes from? Should I add some weights in my loss function? Do I use it in a right way? Is my model wrong somewhere. I attach to this post a Notebook of my test. Maybe someone encountere the same problem before and could help me?