Yes, I trained my model to be a binary classifier, so either an entity is a gene/protein or not because I simply want to calculate the medical word / word ratio to determine whether a document can be classified as medical or not. I uploaded my whole code with a high level explanation of every step here: https://github.com/marcelbra/DocTagger
You can easily use a different data set or use the same one but need to change the labels the model uses if you want to use GENETAG and differentiate between gene and protein.
run_ner
and utils_ner
are from transformers repo, with this u can train a model.
in doc_builder
the actual tagging of a document happens (I’m working on CORD-19)
pred_ner
is the actual prediction, I may have modified the script I posted earlier!
Hope it helps