How to tune BERT for words if they could belong to many (or even all) our our classes?

Essentially, we can tune BERT for sentences for our classes. But how can we tune BERT for words specially if they could even belong to many or all of our classes (e.g. a set of 10 classes)?

Not tuning the BERT seems to yield unfavorable results.

A question by me and @lais823

You can use this for finetuning BERT for words (specifically, tokens). I think the default there is multi-output classification, not multi-label, but you can change it to multi-label relatively easy.