Essentially, we can tune BERT for sentences for our classes. But how can we tune BERT for words specially if they could even belong to many or all of our classes (e.g. a set of 10 classes)?
Not tuning the BERT seems to yield unfavorable results.
A question by me and @lais823