Using transformers (BERT, RoBERTa) without embedding layer

tueboesen · December 13, 2020, 6:17pm

I’m looking to train a RoBERTa model on protein sequences, which is in many ways similar to normal nlp training, but in others quite different.

In the language of proteins, I have 20 characters instead of the normal 26 characters used in english (it is 26 right? :D), so that is rather similar. The big difference is that you don’t really combine the characters in proteins to actual words, but rather just keep each character as a distinct token or class.

Hence essentially my input to the transformer model could just be a list of numbers ranging from 0-19. However that would mean that my input would only have 1 feature if I did that, and I’m not sure a transformer could work with that?

I’m thinking of just doing a onehot encoding of these characters, which would give me 20 input features. However this is of course still very low in comparison to how normal transformers are trained, where d_model is somewhere in the range of 128-512 if I understand correctly.
Does anyone have any experience with anything like this? any good advice for how it is most likely to work?